Web scrape Hacker News with Node.js

6 min readOct 2, 2018

Challenge: Web scrape Hacker News website and output selected number of posts in the console, with hackernews -p [num of posts] command.

I tried a few different libraries while developing and my latest solution is written using:

node-fetch for fetch API,
cheerio for HTML elements selection,
commander for helping to run the program from a cmd line, and allowing custom command flags, such as: hackernews -p 5, and hackernews --help.

Quick set up: mkdir <name>, cd <name>, run npm init -y (skips the questions), npm i (installs node modules), then make a server.js file where we will write our code.

Also, create a .gitignore file with this:

# dependencies
/node_modules

This code will ignore node modules when you’re pushing to GitHub.

PLAN:

Let’s break down our approach. What do we want to do and in what order.

Scrape the website. Get the raw HTML. We will get one joined html for all the pages we need depending on number of posts required.
Extract the values we need from that html. Cheerio will do the work here.
Validate the values. We need to ensure that comments value is a number, uri is valid, etc.
Make it a CLI. Add hackernews -p [num of posts] command to be run from the console. Which makes it a CLI (command line interface). Here we will use commander and we will add a few things in package.json.

Let’s get into it.

STEP 1: Scrape the website.

Add the dependencies:

npm i node-fetch
npm i cheerio

Then in server.js type:

const fetch = require('node-fetch')
const cheerio = require('cheerio')const getPagesArray = (numberOfPosts) =>
  Array(Math.ceil(numberOfPosts / 30))   //divides by 30 (posts per page)
    .fill()                          //creates a new array
    .map((_, index) => index + 1)   //[1, 2, 3, 4,..] PagesArrayconst getPageHTML = (pageNumber) =>
  fetch(`https://news.ycombinator.com/news?p=${pageNumber}`)
    .then(resp => resp.text())   //Promiseconst getAllHTML = async (numberOfPosts) => {
  return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
    .then(htmls => console.log(htmls.join(''))) //one JOINED html
}getAllHTML(5)   // get all HTML for 5 posts

…and run node server in your console to see the output of html for 5 posts.

In getPagesArray function we calc how many pages we will need to fetch from to get the html needed for the numberOfPosts required. If we wanted 125 posts, the Math.ceil(125 / 30) = 5, and then:

…then [1,2,3,4,5].map(getPageHTML)

Step 1 complete. We have the raw html.

STEP 2: Extract the values we need.

Now we will use cheerio to get post’s title, uri, author, points, comments and rank from the html we have. And we will push the post’s object to results array only for the number of posts we need.

const getPosts = (html, posts) => {
    let results = []
    let $ = cheerio.load(html)$('span.comhead').each(function() {
      let a = $(this).prev()let title = a.text()
      let uri = a.attr('href')
      let rank = a.parent().parent().text()let subtext = a.parent().parent().next().children('.subtext').children()
      let author = $(subtext).eq(1).text()
      let points = $(subtext).eq(0).text()
      let comments = $(subtext).eq(5).text()let obj = {
         title: title,
         uri: uri,
         author: author,
         points: points,
         comments: comments,
         rank: parseInt(rank)
      }
      if (obj.rank <= posts) {
        results.push(obj)
      }
    })
    if (results.length > 0) {
      console.log(results)   
      return results
    }
  }

Modify getAllHTML function to call getPosts function at the end:

const getAllHTML = async (numberOfPosts) => {
  return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
    .then(htmls => getPosts(htmls.join(''), numberOfPosts))
}getAllHTML(5)

…and run node server in your console to see the output of html of 5 posts in this format:

Step 2 complete. We now have an array of objects with the data we want.

STEP 3: Validate the values.

Now we will build a few helper functions where we will validate our values before they make their way into our object.

//VALIDATIONS:const checkInput = (input) => {
  if (input.length > 0 && input.length < 256){
    return input
  }else {
    return input.substring(0,25)+"..."
  }
}const checkURI = (uri) => {
  let regex = /(^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)/
  if (regex.test(uri)){
    return uri
  }else {
    return "uri not valid"
  }
}const checkPoints = (points) => {
  if (parseInt(points) <= 0) {
    return 0
  }else {
    return parseInt(points)
  }
}const checkComments = (comments) => {
  if (comments === 'discuss' || comments === '' || parseInt(comments) <= 0) {
    return 0
  }else {
    return parseInt(comments)
  }
}

Modify getPosts, func to call the validation functions when forming an obj:

let obj = {
       title: checkInput(title),
       uri: checkURI(uri),
       author: checkInput(author),
       points: checkPoints(points),
       comments: checkComments(comments),
       rank: parseInt(rank)
    }

Step 3 complete. We added validation functions.

STEP 4: Let’s make it a CLI.

What we have now is great, we call getAllHTML(25) and we get 25 posts in the format we want. But we want to be able to call hackernews -p 25 from the command line to get those 25 posts. This is where commander will help us.

Add the dependency with npm i commander . And require it in server.js:

const program = require('commander')

Write this ‘commander’ function which will call our getAllHTML and getPosts functions. Here we specify flags, such as -p and--posts . 30 is our default value.

program
  .option('-p, --posts [value]', 'Number of posts', 30)
  .action(args =>
    getAllHTML(args.posts)
      .then(html => getPosts(html, args.posts))
  )program.parse(process.argv)

And then modify the getAllHTML function to only return the html:

const getAllHTML = async (numberOfPosts) => {
  return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
    .then(htmls => htmls.join(''))
}

One last thing that we will do in our server.js file is adding this Unix-style shebang line on the top:

#!/usr/bin/env node

Here is some info on it, but it’s for allowing this file to be symlinked with a command we want to execute from a command line. If we didn't have this line and we did everything specified below in this article, we would get a syntax error.

Now we could call node server -p 10 to get 10 posts. But we want to use hackernews or any other one word instead of node server. So we go to:

package.json

…file where we can change a few things to make it happen.

We will supply a bin field in our package.json which is a map of command name to local file name. hackernews is the command I chose to call instead of node server, and ./server.js is my local script file to be run with this command. This format allows us as to provide more than one script mapping if needed.

"dependencies": {
    "cheerio": "^1.0.0-rc.2",
    "commander": "^2.18.0",
    "node-fetch": "^2.2.0"
  },
  "bin": {
    "hackernews": "./server.js"
  }
}

Then run:

npm link

And now run hackernews -p 17 or hackernews --posts 17, or any other number of posts you want to get and see it working.

AMAZING.

You can always npm unlink and change the name of the command.

2. Now let’s config a command that other people (who cloned your repo) would run to get all this nicely working. Add this bold line where specified:

"scripts": {
    "install-hackernews": "npm link && npm i -g",
    "test": "echo \"Error: no test specified\" && exit 1"
  },

They would need to run:

npm run install-hackernews

…to get all these other commands going (link and npm i -g global install, so that hackernews -p 25can be run from any other directory, such as Desktop). You can choose any name here:

"<chosen-name>": "npm link && npm i -g",

The npm link command allow us to locally ‘symlink a package folder’, it will locally install any command listed in the bin field of our package.json. In other words, npm link is like a NodeJS package installation simulator here.

Step 4 complete. We have a CLI here.

Quick experiment:

Let’s quickly check how this all works if we made a new file called day.js and in this file we only wrote:

#!/usr/bin/env nodeconst days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
const now = new Date();console.log("Today is " + days[now.getDay()] + String.fromCodePoint(0x1f43c))

Then in package.json:

"bin": {
    "hackernews": "./server.js",
    "day": "./day.js"
  }

and we run: npm link and then day. What do you see? What would you see if you cd back to Desktop and run day?

Thank you for coding-along! Here is the link to my repo:

Allegra9/node-webscrapper

Contribute to Allegra9/node-webscrapper development by creating an account on GitHub.

github.com