Many proprietary search engines have to deal with complex licensing issues – an example of this is WestLaw, which searches legal documents, and has to deal with privacy issues, costs of scanning documents, and so on. For a research project, SSL certificates are appealing because you can do something interesting without dealing with these types of problems (see the final search engine here).
For acquiring data, I started with a list of a million domains and wrote a Scala script that checked each for HTTPs and sent the results to Solr. For this project Solr works well as a primary data store. If you push data to it, it makes column definitions for you, so it’s almost entirely hands off, although this does not work for updates. Another nice feature is that if you put the correct suffixes on columns and use the default configuration it will infer the types (e.g. _txt is a text column you want indexed, _s is a string you don’t want changed, and _ss is an array of strings).
I tested all the apparently popular methods for parallelism in Scala, since I started this without a clear idea of how any of them worked. I built the first iteration with Akka, which is used for non-locking hand-offs between steps, but I realized partway through that this isn’t that useful since this process only has two I/O bound steps (retrieval over the network and pushing to Solr) so Akka is overkill.
If you build something this with Solr, it’s important not to do frequent commits, or the whole thing drags to a crawl.
Futures (and the .par operation) let you create a bunch of lambdas, which are then run by a threadpool, which is much more compelling. The problem with long running parallel scripts is that they seem to eventually die or exhaust some open resource (file handles, rate limit on DNS lookups). This typically only happens after running for hours which is a pain to troubleshoot and not really worth the effort for this type of project.
My preferred solution to this is to insert a list of desired tasks (in this case the million domains) into RabbitMQ, which hands out tasks to threads, and marks them as complete when the thread acknowledges the completion of the task. While this sounds like over-engineering, it makes it super-easy to kill a script and resume, as any tasks that were handed out but not completed just get re-run. I had in the past written a similar script in C# and just opened multiple copies of it for threading, but I found that combining RabbitMQ and Scala’s Futures gives you a lot of control.
Another nice thing about RabbitMQ is that it gives you the option of using multiple language runtimes in an ETL script. I find this compelling because there are some great NLP tools in both Python and Java, and this way you know you won’t get stuck. If you wanted to created PDF renditions of a Word document as part of this type of script, you’d probably end up scripting Word in .NET. I’ve also noticed that a lot of companies that provide SDKs for their APIs have preferred languages (e.g. there isn’t currently a good Javascript SDK for DropBox, but the C# one is fine).
The biggest issue I haven’t figure out how to solve with the event-based databases is handling rate limiting. I think the way to do this is to have a queue of tasks, and a second queue of “dead” tasks. You then either set a timeout to kill the task or you just pull from the queue once a day to revive the tasks.
A potentially superior architecture for this problem would be to use a message queue on AWS and AWS lambda, or the Azure equivalent, as this lets you scale up the parallelism much higher. Right now I’m doing this infrequently, so it doesn’t hurt to let a script run for a week (and it takes longer than that to write a UI to handle the data)
While there is not a lot of novelty to what I did for data acquisition, Scala did end up being a nice choice for a couple reasons. JSON parsing in Scala is interesting, because it allows you to enforce a type and pattern match on the contents of a JSON payload. One of the things I don’t like about C# is you’re forced to sometimes make one-off classes that are far from where they are used, but Scala lets you put the use-once types in the spot where they are used, which makes it easier to follow the code.
I think that if I continue trying to this sort of thing, I’ll end up building out a library of utility functions in Scala for different APIs/scenarios (e.g. pull a JSON file, find the existing Solr records, update them), which would be useful for a lot of projects. This is also true of specific API usage – for example you could imagine an IFTTT style JVM library that offered access to a range of services.