Scraping Videos with PhantomJS

I’ve been using PhantomJS for some scraping projects – PhantomJS is a headless webkit, packaged to run Javascript scripts. Some of my family are still on a slow connection with a low monthly bandwidth cap- they can’t watch many videos. This is unfortunate given the number of training classes available online (e.g., pattern drafting, in this case).

Unfortunately, downloading video isn’t a supported use case for PhantomJS. It appears that the primary goal of PhantomJS is automated testing (e.g. like Selenium), and they don’t want to include the necessary code to render videos, as it involves potentially dealing with many codecs.

Fortunately, there is alternative project that works well – youtube-dl (github link). This is a pre-packaged python project which lets you download youtube videos by channel, search result, playlist, etc. It also supports Google Video, Photobucket, Yahoo! Video, Dailymotion, blip.tv, DepositFiles, vimeo, and more.

Setup is simple-

git clone https://github.com/rg3/youtube-dl

Python 2.x must be available and on PATH (if on Windows).

You can run it easily, like so:

youtube-dl.exe -o %(stitle)%s(ext)s http://youtube.com/user/DonMcCunn

Notable, the command line arguments allow you to modify the ouput filenames. Other args allow audio extraction, specifying desired format, simulation options, and authentication. Typical file sizes are 5-10 MB per 5 minute video.