Web scrape Hacker News with Node.js
Challenge: Web scrape Hacker News website and output selected number of posts in the console, with hackernews -p [num of posts]
command.
I tried a few different libraries while developing and my latest solution is written using:
node-fetch
for fetch API,cheerio
for HTML elements selection,commander
for helping to run the program from a cmd line, and allowing custom command flags, such as:hackernews -p 5
, andhackernews --help
.
Quick set up: mkdir <name>
, cd <name>
, run npm init -y
(skips the questions), npm i
(installs node modules), then make a server.js
file where we will write our code.
Also, create a .gitignore
file with this:
# dependencies
/node_modules
This code will ignore node modules when you’re pushing to GitHub.
PLAN:
Let’s break down our approach. What do we want to do and in what order.
- Scrape the website. Get the raw HTML. We will get one joined html for all the pages we need depending on number of posts required.
- Extract the values we need from that html.
Cheerio
will do the work here. - Validate the values. We need to ensure that comments value is a number, uri is valid, etc.
- Make it a CLI. Add
hackernews -p [num of posts]
command to be run from the console. Which makes it a CLI (command line interface). Here we will usecommander
and we will add a few things inpackage.json
.
Let’s get into it.
STEP 1: Scrape the website.
Add the dependencies:
npm i node-fetch
npm i cheerio
Then in server.js
type:
const fetch = require('node-fetch')
const cheerio = require('cheerio')const getPagesArray = (numberOfPosts) =>
Array(Math.ceil(numberOfPosts / 30)) //divides by 30 (posts per page)
.fill() //creates a new array
.map((_, index) => index + 1) //[1, 2, 3, 4,..] PagesArrayconst getPageHTML = (pageNumber) =>
fetch(`https://news.ycombinator.com/news?p=${pageNumber}`)
.then(resp => resp.text()) //Promiseconst getAllHTML = async (numberOfPosts) => {
return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
.then(htmls => console.log(htmls.join(''))) //one JOINED html
}getAllHTML(5) // get all HTML for 5 posts
…and run node server
in your console to see the output of html for 5 posts.
In getPagesArray
function we calc how many pages we will need to fetch from to get the html needed for the numberOfPosts
required. If we wanted 125 posts, the Math.ceil(125 / 30) = 5
, and then:
…then [1,2,3,4,5].map(getPageHTML)
Step 1 complete. We have the raw html.
STEP 2: Extract the values we need.
Now we will use cheerio to get post’s title
, uri
, author
, points
, comments
and rank
from the html we have. And we will push the post’s object to results
array only for the number of posts we need.
const getPosts = (html, posts) => {
let results = []
let $ = cheerio.load(html)$('span.comhead').each(function() {
let a = $(this).prev()let title = a.text()
let uri = a.attr('href')
let rank = a.parent().parent().text()let subtext = a.parent().parent().next().children('.subtext').children()
let author = $(subtext).eq(1).text()
let points = $(subtext).eq(0).text()
let comments = $(subtext).eq(5).text()let obj = {
title: title,
uri: uri,
author: author,
points: points,
comments: comments,
rank: parseInt(rank)
}
if (obj.rank <= posts) {
results.push(obj)
}
})
if (results.length > 0) {
console.log(results)
return results
}
}
Modify getAllHTML
function to call getPosts
function at the end:
const getAllHTML = async (numberOfPosts) => {
return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
.then(htmls => getPosts(htmls.join(''), numberOfPosts))
}getAllHTML(5)
…and run node server
in your console to see the output of html of 5 posts in this format:
Step 2 complete. We now have an array of objects with the data we want.
STEP 3: Validate the values.
Now we will build a few helper functions where we will validate our values before they make their way into our object.
//VALIDATIONS:const checkInput = (input) => {
if (input.length > 0 && input.length < 256){
return input
}else {
return input.substring(0,25)+"..."
}
}const checkURI = (uri) => {
let regex = /(^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)/
if (regex.test(uri)){
return uri
}else {
return "uri not valid"
}
}const checkPoints = (points) => {
if (parseInt(points) <= 0) {
return 0
}else {
return parseInt(points)
}
}const checkComments = (comments) => {
if (comments === 'discuss' || comments === '' || parseInt(comments) <= 0) {
return 0
}else {
return parseInt(comments)
}
}
Modify getPosts
, func to call the validation functions when forming an obj
:
let obj = {
title: checkInput(title),
uri: checkURI(uri),
author: checkInput(author),
points: checkPoints(points),
comments: checkComments(comments),
rank: parseInt(rank)
}
Step 3 complete. We added validation functions.
STEP 4: Let’s make it a CLI.
What we have now is great, we call getAllHTML(25)
and we get 25 posts in the format we want. But we want to be able to call hackernews -p 25
from the command line to get those 25 posts. This is where commander
will help us.
Add the dependency with npm i commander
. And require it in server.js
:
const program = require('commander')
Write this ‘commander’ function which will call our getAllHTML
and getPosts
functions. Here we specify flags, such as -p
and--posts
. 30 is our default value.
program
.option('-p, --posts [value]', 'Number of posts', 30)
.action(args =>
getAllHTML(args.posts)
.then(html => getPosts(html, args.posts))
)program.parse(process.argv)
And then modify the getAllHTML
function to only return the html:
const getAllHTML = async (numberOfPosts) => {
return Promise.all(getPagesArray(numberOfPosts).map(getPageHTML))
.then(htmls => htmls.join(''))
}
One last thing that we will do in our server.js
file is adding this Unix-style shebang line on the top:
#!/usr/bin/env node
Here is some info on it, but it’s for allowing this file to be symlinked with a command we want to execute from a command line. If we didn't have this line and we did everything specified below in this article, we would get a syntax error.
Now we could call node server -p 10
to get 10 posts. But we want to use hackernews
or any other one word instead of node server
. So we go to:
package.json
…file where we can change a few things to make it happen.
- We will supply a
bin
field in ourpackage.json
which is a map of command name to local file name.hackernews
is the command I chose to call instead ofnode server
, and./server.js
is my local script file to be run with this command. This format allows us as to provide more than one script mapping if needed.
"dependencies": {
"cheerio": "^1.0.0-rc.2",
"commander": "^2.18.0",
"node-fetch": "^2.2.0"
},
"bin": {
"hackernews": "./server.js"
}
}
Then run:
npm link
And now run hackernews -p 17
or hackernews --posts 17
, or any other number of posts you want to get and see it working.
AMAZING.
You can always npm unlink
and change the name of the command.
2. Now let’s config a command that other people (who cloned your repo) would run to get all this nicely working. Add this bold line where specified:
"scripts": {
"install-hackernews": "npm link && npm i -g",
"test": "echo \"Error: no test specified\" && exit 1"
},
They would need to run:
npm run install-hackernews
…to get all these other commands going (link
and npm i -g
global install, so that hackernews -p 25
can be run from any other directory, such as Desktop). You can choose any name here:
"<chosen-name>": "npm link && npm i -g",
The npm link
command allow us to locally ‘symlink a package folder’, it will locally install any command listed in the bin
field of our package.json
. In other words, npm link
is like a NodeJS package installation simulator here.
Step 4 complete. We have a CLI here.
Quick experiment:
Let’s quickly check how this all works if we made a new file called day.js
and in this file we only wrote:
#!/usr/bin/env nodeconst days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
const now = new Date();console.log("Today is " + days[now.getDay()] + String.fromCodePoint(0x1f43c))
Then in package.json:
"bin": {
"hackernews": "./server.js",
"day": "./day.js"
}
and we run: npm link
and then day
. What do you see? What would you see if you cd back to Desktop and run day
?
Thank you for coding-along! Here is the link to my repo: