Guide To Solr
Please find below some of the most popular articles I've written on Guide To Solr.
Table of Contents
Building a full-text index of git commits using lunr.js and Github APIs
Converting git commit history to a solr full-text index
Expert Search Statistics
Finding Corporate Sponsors of Open Source
Fixing org.apache.solr.common.SolrException: Length Required
Full-Text Indexing PDFs in Javascript
Improving the default Android Keyboard
Solr CSV DataImportHandler sample
Solr DataImportHandler example with FileDataSource
Building a full-text index of git commits using lunr.js and Github APIs
Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer [...] Read More...
Converting git commit history to a solr full-text index
I built a 4 million document archive from Github commits, which lets you search for open source experts, ranked by commit count. Click here to try the demo. Solr is a relatively recent addition to the world of Lucene (2007); it adds a web-app UI over lucene, scaling (highly available reads), and configuration. For those [...] Read More...
Expert Search Statistics
The following are some interesting statistics about the Github expert-finder. Unique repositories: 18,977 Source git repos (GB): 250+ GB Solr Index Size: 3.2 GB Time to build index: ~12 hours spread over several days (had to restart indexer several times) Number of commits: 4,579,236 Read More...
Finding Corporate Sponsors of Open Source
I copied about 19,000 git repositories into a full-text solr index. Because commits are tied to email addresses this provides interesting insight into corporate open source contributions. The search front-end I added lets you search for programmers or companies, grouped by the number of commits. For example, searching for Linux returns the following results: linux-foundation [...] Read More...
Fixing org.apache.solr.common.SolrException: Length Required
I received the following exception, after making no code changes: org.apache.solr.common.SolrException: Length Required The issue is that CommonsHttpSolrServer does not send a Content-Length header in updates. The root cause of my issue was switching the front-end proxy from Apache to Nginx, which apparently is more strict about headers. Read More...
Full-Text Indexing PDFs in Javascript
I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the [...] Read More...
Improving the default Android Keyboard
My Android keyboard makes word suggestions as you type. The algorithm appears to be a frequency-based text look-up, although it occasionally picks up similar-sounding words. While usable, it has enough issues to be worth replacing. Android kindly lets you do this, and there are numerous apps to do so. To build a new keyboard, we [...] Read More...
Solr CSV DataImportHandler sample
The following will import a two field CSV file into solr, assuming two columns, name and count. The name field is always quoted. <dataConfig> <dataSource name=”ds1″ type=”FileDataSource” /> <document> <entity name=”ngrams” processor=”LineEntityProcessor” url=”E:/Projects/Data/words-txt.csv” dataSource=”ds1″ transformer=”RegexTransformer”> <field column=”rawLine” regex=”^"(.*)"\t(.*)$” groupNames=”name,count” /> </entity> </document> </dataConfig> Read More...
Solr DataImportHandler example with FileDataSource
This imports each line of a text file as a single document, probably about the simplest thing you can do. The schema has a single attribute, “name”, which is defined as a unique attribute. <dataConfig> <dataSource name=”ds1″ type=”FileDataSource” /> <document> <entity name=”entity” processor=”LineEntityProcessor” url=”E:/Projects/Data/wlist_all/wlist_match10.txt” dataSource=”ds1″> <field column=”rawLine” name=”name” /> </entity> </document> </dataConfig> Read More...
