Skip to Main ContentSkip to Footer

Do I need python Scrapy to build a web scraper?

September 16, 2019

Always Scrapy

I googled “how to build a web crawler” and the first results were always in Python and many of them suggesting to use Scrapy library from pip to build them.

I am building both the Frontend and the Backend of this specific project using just Node and JS, so I didn’t want to switch language, even if it looked like more people used others libraries or languages.

Side note: my dislike of Python

1  HTtmbOf2VUX0mLwZr64w2Q

I prefer to use languages where I can clearly see where my functions or instructions starts and end, I never liked to use Python even if sometimes I really need to deal with it when using machine learning papers implementations.

My Journey through JavaScript crawlers

After reading and testing Scrapy, I started exploring the gallery of npm packages to solve the same problem but in Node.js:

Apify — Good but not for me.

1  uCNt7LmexBFcFSyHG6xVyA

I lost an hour trying to make a simple page parsed with Apify SDK, trying to understand how to access the DOM and selectors. If you want a great crawler this might work for you but you need to understand its particular logic and I didn’t have time for it.

Just HTML

I choose SimpleCrawler as the first program to fetch all the pages from various website thanks to its speed. It is fast because it focuses only on html documents, without having an entire JavaScript environment like others with Puppeteer.

Here’s the entire code for a simple crawler — just pure HTML: https://gist.github.com/Giorat/12be52223c9d6da5e7e872621bf009ca

var supercrawler = require(“supercrawler”);

var crawler = new supercrawler.Crawler({
// Tme (ms) between requests
interval: 1000,
// Maximum number of requests at any one time.
concurrentRequestsLimit: 5,
// Time (ms) to cache the results of robots.txt queries.
robotsCacheTime: 3600000,
// Query string to use during the crawl.
userAgent: “Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)“,
});

crawler.addHandler(“text/html”, function (context) {
var sizeKb = Buffer.byteLength(context.body) / 1024;
console.dir(context.body.toString());
// here you can make all the parsing to the HTML of the body of the page
console.log(“Processed”, context.url, “Size=”, sizeKb, “KB”);
});

crawler.getUrlList()
.insertIfNotExists(new supercrawler.Url(”https://gist.github.com/discover/”))
.then(function () {
return crawler.start();
});

After having parsed the body of the page and making sure all the required elements to parse were present, I continued parsing the elements. If required I started a more deep parsing with another Crawler with Javascript to render SPA apps rendered in the client side.

Headless Chrome Crawler is the Winner!

1  ywIS6nuX  ULYUuMTmRRcTA

HCC or Headless Chrome Crawler enabled me to parse all those client rendered pages without any problem, using all the selectors provided from jQuery in less than five minutes.

Here’s the entire code for a simple crawler: https://gist.github.com/Giorat/85d340b62a196eb70c1da500789cf402

const HCCrawler = require(‘headless-chrome-crawler’);

console.log(‘Starting the fetch’);

const singlePage = false;
let maxDepthCrawler = 6;
if(singlePage)
maxDepthCrawler=1;

const makeupPageUrl = ’http://scrapoxy.io/’;

(async () => {
const crawler = await HCCrawler.launch({
userAgent: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36’,
//headless: false,
//slowMo: 10,
evaluatePage: () => ({
title: $(‘title’).text(),
//ADD here your other elements to parse from jQuery selectors
}),
onSuccess: (result) => {
const results = result.result;
console.log(`PRODUCT - ${results.title}.`);
//ACCESS here the elements evaluted inside the page from the previous section, the name have to be the same
}
});

await crawler.queue({  
        url: makeupPageUrl,  
        maxDepth: maxDepthCrawler,  
        depthPriority: false,  
        allowedDomains: \[/sephora\\.it$/\],  
});  
await crawler.onIdle();  
await crawler.close();  

})();

Find Selectors 10x faster

To be ten times faster at building the JavaScript selectors for your crawlers, I suggest you to install the following browser extension:

How can I find the right selector?

If you are trying to parse a specific node you need to:

  • open your Chrome browser
  • open the Inspector Dev tools and select the Console tab
  • start playing with Selectors! and print the results in the console

1  SaLhL9r2qoZ70EymzUhMnA

Conclusion

Building a crawler that feeds itself with web content should be based on a technology that is part of the web on every website — like Javascript and Node, and not like Python!

Future improvements to your crawler might be adding a pool of proxies to add a simple load balancer to mitigate all the requests made from your crawlers using for example the free scrapoxy library: http://scrapoxy.io/

In this article I have just shared my opinion, I hope you find some tips or ways to speed up your work on building a web crawler. If you prefer to use Scrapy let me know on twitter!

References and Resources


Subscribe to my email list!

If you enjoy my work, you should definitely join my newsletter"Giorat Mails". It’s one email a week with everything interesting I’ve read or found, plus new articles and fresh tutorials.

Sign up for our newsletter