One of the challenges in web crawling and scraping is determining which URLs to scrape. It’s easy for a site to have many urls that aren’t visited by humans, like a stock photo site that uses an API to supplement its data. Sites with sessionid parameters or dynamic content may make many duplicate or similar pages.
In a previous post I described a phantomjs adsense scraper, which demonstrates an instance where the tool is very helpful. One might scrape ads to find out who is running campaigns to find out what is selling, how products are pitched, and who you might sell advertising to, if you are a publisher. There are products to do this, like MixRank.
There are a couple ways you can do this on your own. There is a not-for-profit called Common Crawl, which has a 70 TB index on AWS, which lets you run Hadoop map-reduce queries. It has the entire text of many pages, which would allow searching the original source of the page. I started down this road – this would work as a generalized solution if I were building a product, but I found an easier way.
There are a surprising number of search engine APIs – e.g. Yahoo, DuckDuckGo, Blekko, and Yandex. Blekko is very SEO focused and exposes a lot of useful fields, such as whether a site is an adsense publisher. Much of this understandably requires either an API key or login, but you can easily add parameters to turn the output into JSON and increase the paging size, like so:
http://blekko.com/ws/?q=guitar+tabs+/adsense=+/ps=100&json=1&
This gives you nicely formatted entries, like so:
{ "c" : 1, "display_url" : "ultimate-guitar.com", "n_group" : 1, "rss" : "http://www.ultimate-guitar.com/modules/rss/all_updates.xml.php", "rss_title" : "Ultimate-Guitar.Com Updates", "short_host" : "ultimate-guitar.com", "short_host_url" : "http://www.ultimate-guitar.com/", "snippet" : "Search archives or submit tab. Your #1 source for guitar tabs, bass tabs, chords and guitar pro tabs. Guitar and bass tabs archive with daily updates. In order to use the widgets you need to. You can add up to three widgets to the home page's widget panel.", "toplevel" : "1", "url" : "http://www.ultimate-guitar.com/", "url_title" : "ULTIMATE GUITAR TABS ARCHIVE - 300,000+ Guitar Tabs, Bass Tabs, Chords and Guitar Pro Tabs" }, { "c" : 2, "display_url" : "chordie.com", "main_slashtag_boosted" : "/blekko/tabs", "n_group" : 2, "rss" : "http://www.chordie.com/rss/mostpopular.rss", "rss_title" : "Most popular guitar songs", "short_host" : "chordie.com", "short_host_url" : "http://www.chordie.com/", "snippet" : "Guitar chords and guitar tablature made easy. Chordie is a search engine for finding guitar chords and guitar tabs. Search the Internet for guitar chords and tabs/tablatures. Guitar chords and guitar tabs. This morning a lot of people were getting a message about being banned for life.", "toplevel" : "1", "url" : "http://www.chordie.com/", "url_title" : "Guitar Tabs, Guitar Chords and Lyrics - Chordie" }, { "c" : 3, "display_url" : "guitartabs.net", "n_group" : 3, "short_host" : "guitartabs.net", "short_host_url" : "http://www.guitartabs.net/", "snippet" : "ActiveBass.com Premier site with theory + bass tab search. GuitarWar.com Ultimate guitar tab competition. Tab Robot Unique guitar tabs engine. GuitarTricks Guitar tab,chords,and video lessons. Olga Search- search the OLGA tab archive by putting in the artist or song name in the search field at the top of the page.", "toplevel" : "1", "url" : "http://www.guitartabs.net/", "url_title" : "Guitar Tabs Dot Net - Your #1 source for guitar tabs" },
This saves hours over using Elastic Map-Reduce, much like purchasing a product would likely save me hours over doing it this way 😉