Building a full-text index of git commits using lunr.js and Github APIs

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure – results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.

To test this, access a URL like so. The configurable arguments are the repository owner and name fields.
https://api.github.com/repos/torvalds/linux/commits

This is what a commit looks like:

{
  "sha": "7638417db6d59f3c431d3e1f261cc637155684cd",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd",
  "author": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "committer": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "message": "my commit message",
  "tree": {
    "url": "https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132",
    "sha": "827efc6d56897b048c772eb4087f854f46256132"
  },
  "parents": [
    {
      "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0",
      "sha": "7d1b31e74ee336d15cbd21741bc88a537ed063a0"
    }
  ]
}

To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.

python -m SimpleHTTPServer

This data has a number of nested objects and must be flattened to fit into the lunr.js full-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.

var documents = [];

function recurse(doc_num, base, obj, value) {
  if ($.isPlainObject(value)) {
    $.each(value, function (k, v) {
      recurse(doc_num, base + obj + "_", k, v);
    });
  } else {
    process(doc_num, base + obj, value);              
  }
}

function process(doc_num, key, value) {
  if (documents.length <= doc_num)
    documents[doc_num] = {};

  if (value !== null)
    documents[doc_num][key] = value + '';
}

$.each(data, function(doc_num, commit) {
  $.each(commit, function(k, v) {
    recurse(doc_num, '', k, v)
  });
});

Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr's numerous XML config files. Lunr doesn't have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.

Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.

The function accepts configuration objects resembling ExtJS config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.

function inferIndex(documents, config) {
  return lunr(function() {
    this.ref('id');
    var found = {};
    var idx = this;

    $.each(documents,
      function(doc_num, doc) {

        if (config && 
            config.limit && 
            config.limit < doc_num) 
          return;

        $.each(doc, function(k, v) {
          if (!found[k]) {
            if (config && config[k]) { 
              idx.field(k, config[k]);
            } else {
              idx.field(k);
            }
            found[k] = true;
          }
        });
    });
  });
}

var index = 
  inferIndex(documents, 
    {limit: 1, 
     'commit_author_name':{boost:10}});

Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.

$.each(documents, 
  function(doc_num, attrs, doc_cb) {
    var doc = 
      $.extend(
        {id: doc_num}, attrs);
    
    if (doc_cb) { 
      doc = doc_cb(doc); 
    }

    index.add(doc);
});

At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I've found it particularly convenient to facet on author, date, or author's company.

If you have access to the original documents, you can easily construct facets based on the results of a lunr search:

function facet(index, query, data, field) { 
  var results = index.search(query);
  
  var facets = {}; 
  $.each(results, function(index, searchResult) { 
    var doc = data[searchResult.ref];

    facets[doc[field]] = 
      (facets[doc[field]] === undefined ? 0 : 
      facets[doc[field]]) + 1; } ); 
  
  return facets; 
}

Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most - this can also tell you who has worked with various pieces of technology.

The following query demonstrates this technique:

var facets = 
   facet(index, 
        'driver', 
        documents, 
        'commit_author_name');
{"Wolfram Sang":24,"Linus Torvalds":3}

The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.

If you enjoyed this, you may also be interested in: