Always Scrapy

I googled "how to build a web crawler" and the first results were always in Python and many of them suggesting to use Scrapy library from pip to build them.

I am building both the Frontend and the Backend of this specific project using just Node and JS, so I didn’t want to switch language, even if it looked like more people used others libraries or languages.

Side note: my dislike of Python

python logo icon

I prefer to use languages where I can clearly see where my functions or instructions starts and end, I never liked to use Python even if sometimes I really need to deal with it when using machine learning papers implementations.

My Journey through JavaScript crawlers

After reading and testing Scrapy, I started exploring the gallery of npm packages to solve the same problem but in Node.js:

Apify — Good but not for me.

apify logo icon

I lost an hour trying to make a simple page parsed with Apify SDK, trying to understand how to access the DOM and selectors. If you want a great crawler this might work for you but you need to understand its particular logic and I didn’t have time for it.

Just HTML

I choose SimpleCrawler as the first program to fetch all the pages from various website thanks to its speed. It is fast because it focuses only on html documents, without having an entire JavaScript environment like others with Puppeteer.

Here’s the entire code for a simple crawler — just pure HTML:

var supercrawler = require("supercrawler")

var crawler = new supercrawler.Crawler({
// Tme (ms) between requests
interval: 1000,
// Maximum number of requests at any one time.
concurrentRequestsLimit: 5,
// Time (ms) to cache the results of robots.txt queries.
robotsCacheTime: 3600000,
// Query string to use during the crawl.
userAgent:
"Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)",
})

crawler.addHandler("text/html", function (context) {
var sizeKb = Buffer.byteLength(context.body) / 1024
console.dir(context.body.toString())
// here you can make all the parsing to the HTML of the body of the page
console.log("Processed", context.url, "Size=", sizeKb, "KB")
})

crawler
.getUrlList()
.insertIfNotExists(new supercrawler.Url("https://gist.github.com/discover/"))
.then(function () {
return crawler.start()
})

After having parsed the body of the page and making sure all the required elements to parse were present, I continued parsing the elements. If required I started a more deep parsing with another Crawler with Javascript to render SPA apps rendered in the client side.

Headless Chrome Crawler is the Winner!

google chrome headless crawler

HCC or Headless Chrome Crawler enabled me to parse all those client rendered pages without any problem, using all the selectors provided from jQuery in less than five minutes.

Here’s the entire code for a simple crawler:

const HCCrawler = require("headless-chrome-crawler")

console.log("Starting the fetch")

const singlePage = false
let maxDepthCrawler = 6
if (singlePage) maxDepthCrawler = 1

const makeupPageUrl = "http://scrapoxy.io/"

;(async () => {
const crawler = await HCCrawler.launch({
userAgent:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
//headless: false,
//slowMo: 10,
evaluatePage: () => ({
title: $("title").text(),
//ADD here your other elements to parse from jQuery selectors
}),
onSuccess: (result) => {
const results = result.result
console.log(`PRODUCT - ${results.title}.`)
//ACCESS here the elements evaluted inside the page from the previous section, the name have to be the same
},
})

await crawler.queue({
url: makeupPageUrl,
maxDepth: maxDepthCrawler,
depthPriority: false,
allowedDomains: [/sephora\.it$/],
})
await crawler.onIdle()
await crawler.close()
})()

Find Selectors 10x faster

To be ten times faster at building the JavaScript selectors for your crawlers, I suggest you to install the following browser extension:

How can I find the right selector?

If you are trying to parse a specific node you need to:

Conclusion

Building a crawler that feeds itself with web content should be based on a technology that is part of the web on every website — like Javascript and Node, and not like Python!

Future improvements to your crawler might be adding a pool of proxies to add a simple load balancer to mitigate all the requests made from your crawlers using for example the free scrapoxy library: http://scrapoxy.io/

In this article I have just shared my opinion, I hope you find some tips or ways to speed up your work on building a web crawler. If you prefer to use Scrapy let me know on twitter!

References and Resources