To build a new keyboard, we need a wordlist, preferably one that ranks words by usage, or other context clues. Google distributes a dataset called Google n-grams, which provides word and phrase lists from scanned books in several languages, broken out by year. A list of all phrases from one to five words is 300 GB zipped, although the single word phrases are only 2GB. The data includes many non-word tokens, such as “$100,” “$101,” $102,” etc, as well as typographical errors. I filtered this to two letter words from a single year that occur at least ten times, which leaves a manageable 550k “words,” about 110 MB – a more appropriately sized data set for a phone.
I imported the data into Solr 4, a tool for rolling your own search engine. In this case we search for a single word (what the user typed), and return the top five best matches. Out of the box, Solr includes many useful search features, such as word equivalence, stemming, and extensive customization hooks (see previous examples). Solr has a specialized query language to allow controlling the search criteria and ranking.
As an example, a simplistic approach to word suggestions in the phone keyboard would be a “starts with” search, yielding terrible results for the following two searches:
name:b* baba,babbage,babble,babcock,babe
name:bee* beef,beehive,beep,beeper,beeps
Solr comes with a stemming algorithm to identify words with similar endings, e.g. “run” and “running.” The word list includes essentially every possible grammatical variation of these as it is, so doing a “starts with” search means we don’t need to turn on stemming.
Improving the previous example to sort by frequency yields nearly the same result as my phone:
{!boost b=count}(name:b*) been,before,between,back
{!boost b=count}(name:bee*) been,beer,beef,bee,beethoven
The third term in the result, “bee,” is exactly what the user typed. This should be removed to make better use of limited screen real-estate on the phone.
{!boost b=count}(name:bee*) -name:bee been,beer,beef,beethoven,beech
The problem with this technique is that it assumes that the user typed exactly what they wanted. We can improve this result by storing incorrectly spelled versions of words, and searching against those – if the user mis-types, the search engine can return the original, correct spelling.
To do this, Solr lets you store the same data in multiple formats. For instance, say we add a field called “duplicate,” which removes all duplicate letters. “Beer” becomes “Ber,” “Bookkeeper” becomes “Bokeper,” and so on. The same transformation is applied to the user’s input before searching.
{!boost b=count}(duplicate:bok* OR name:bookk*) book,books,book's,booked,bookseller
This returns good results, even though the user hit “k” twice.
For a second improvement, I’ve added a field which treats each sequence of three characters on the keyboard as equivalent: “q”, “w”, “e” or “r”, “t”, “y”, etc. Each letter of the word is replaced by the first of the sequence – this yields “vqq” for the word “bee.”
{!boost b=count}(typos:vqq*) -name:bee been,need,news,newspaper
These are all words that might have been intended, if your aim was a bit off. In the following contrived example, the user hit “b” instead of “n”:
{!boost b=count}( (name:bew)^2 or (typos:vqq* and -name:bew*) or (dup:bew* and -name:bew*)) -name:bew been,new,need,news,newspaper
{!boost b=sum(appcount,log(count))}name:bee*
Using log(count) will still return records ordered by frequency, but allows the frequency of word use with in an app to override the real-world usage, since log(count) is less than seven or eight for all words in this dataset.
I hit the ‘v’ and the ‘L’ a lot for whatever reason, so adding the field which treats the letters as equivalent to others makes a lot of sense to me. I bet there are a whole bunch more ways you can tighten up the guessing game the phone plays, too.