A couple of months back, I did a proof of concept to build a scraper entirely in JavaScript, using webkit (Chrome) as a parser and front-end.
Having investigated seemingly expensive SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in data warehousing, and a little exposure to natural language processing, but in order to do any of those things I needed a source of data.
The dataset I built is 58,000 Flippa auctions, which have fairly well-structured pages with fielded data. I augmented the data by doing a crude form of entity extraction to see what business models or partners are most commonly mentioned in website auctions.
Architecture
I did the downloading with wget, which worked great for this. One of my concerns with the SaaS solution I demoed, is that if you made a mistake in parsing one field, you might have to pay to re-download some subset of the data.
One of my goals was to use a single programming language. In my solution, each downloaded file is opened in a Chrome tab, parsed, and then closed. I used Chrome because it is fast, but this should be easily portable to Firefox, as the activity within Chrome is a Greasemonkey script. Opening the Chrome tabs is done through Windows Scripting Host (WSH). The chrome extension connects to a Node.js server to retrieve the actual parsing code and save data back to a Postgres database. Having JavaScript on both client and server was fantastic for handling the back and forth communication. Despite the use of a simple programming language, the three scripts (WSH, Node.js, and Greasemonkey) have very different APIs and programming models, so it’s not as simple as I would like. Being accustomed to Apache, I was a little disappointed that I had to track down a script just to keep Node.js running.
Incidentally, WSH is using Internet Explorer (IE) to run its JavaScript; this worked well, unlike the typical web programming experience with IE. My first version of the script was a cygwin bash script, which involved too much resource utilization (i.e. threads) for cygwin to handle. Once I switched to WSH I had no further problems of that sort, which is not surprising considering its long-standing use in corporate environments.
Challenges
By this point, the reader may have noticed that my host environment is Windows, chosen primarily to get the best value from Steam. The virtualization environment is created on VirtualBox using Vagrant and Chef, which make creating virtual machines fairly easy. Unfortunately, it is also easy to destroy them. I kept the data on the main machine, backed up in git, to prevent wasting days of downloading. This turned out to be annoying because it required dealing with two operating systems (Ubuntu and Windows), which have different configuration settings for networking.
As the data volume increased, I found many new defects with this approach. Most were environmental issues, such as timeouts and settings for the maximum number of TCP connections (presumably these are low by default in Windows to slow the spread of bots).
Garbage collection also presented an issue, since the Chrome processes consume resources at an essentially fixed rate (their memory disappears when the process ends). The garbage collection in Node.js causes a sawtooth memory pattern. During this process many Chrome tabs open. The orchestration script must watch for this in order to slow down and allow Node.js to catch up. This script should also pause if the CPU overheats; unfortunately I have not been able to read CPU temperature. Although this capability is supposedly supported by Windows APIs, it is not supported either by Intel’s drivers or my chip.
Successes
A while back I read about Netflix’s Chaos Monkey and tried to apply its principle of assuming failure to my system. Ideally a parsing script should not stop in the middle of a several day run, so it is necessary to handle errors gracefully. Although the scripts have fail-retry logic, it unfortunately differs in each. Node.js restarts if it crashes because it is running tandem with Forever. The orchestration script doesn’t seem to crash, but supports resumption at any point, and watches the host machine to see if it should slow down. The third script, the Chrome extension, watches for failures from RPC calls, and does exponential backoff to retry.
Using the browser as a front-end gives you a free debugger and script interface, as well as a tool to generate xpath expressions.
Possibilities
The current script runs five to ten thousand entries before requiring attention. I intend to experiment with PhantomJS in order to improve performance, enable sharding, and support in-memory connections.
See source on Github here
Thanks to Ariele and Melissa for editing