Gary Sieling

Improving the default Android Keyboard

My Android keyboard makes word suggestions as you type. The algorithm appears to be a frequency-based text look-up, although it occasionally picks up similar-sounding words. While usable, it has enough issues to be worth replacing. Android kindly lets you do this, and there are numerous apps to do so.

Screenshot_2013-02-04-20-05-29 (1)

To build a new keyboard, we need a wordlist, preferably one that ranks words by usage, or other context clues. Google distributes a dataset called Google n-grams, which provides word and phrase lists from scanned books in several languages, broken out by year.  A list of all phrases from one to five words is 300 GB zipped, although the single word phrases are only 2GB. The data includes many non-word tokens, such as “$100,” “$101,” $102,” etc, as well as typographical errors. I filtered this to two letter words from a single year that occur at least ten times, which leaves a manageable 550k “words,” about 110 MB – a more appropriately sized data set for a phone.

I imported the data into Solr 4, a tool for rolling your own search engine. In this case we search for a single word (what the user typed), and return the top five best matches. Out of the box, Solr includes many useful search features, such as word equivalence, stemming, and extensive customization hooks (see previous examples). Solr has a specialized query language to allow controlling the search criteria and ranking.

As an example, a simplistic approach to word suggestions in the phone keyboard would be a “starts with” search, yielding terrible results for the following two searches:

name:b*
baba,babbage,babble,babcock,babe
name:bee*
beef,beehive,beep,beeper,beeps

Solr comes with a stemming algorithm to identify words with similar endings, e.g. “run” and “running.” The word list includes essentially every possible grammatical variation of these as it is, so doing a “starts with” search means we don’t need to turn on stemming.

Improving the previous example to sort by frequency yields nearly the same result as my phone:

{!boost b=count}(name:b*)
been,before,between,back
{!boost b=count}(name:bee*)
been,beer,beef,bee,beethoven

The third term in the result, “bee,” is exactly what the user typed. This should be removed to make better use of limited screen real-estate on the phone.

{!boost b=count}(name:bee*) -name:bee
been,beer,beef,beethoven,beech

The problem with this technique is that it assumes that the user typed exactly what they wanted. We can improve this result by storing incorrectly spelled versions of words, and searching against those – if the user mis-types, the search engine can return the original, correct spelling.

To do this, Solr lets you store the same data in multiple formats. For instance, say we add a field called “duplicate,” which removes all duplicate letters. “Beer” becomes “Ber,” “Bookkeeper” becomes “Bokeper,” and so on. The same transformation is applied to the user’s input before searching.

{!boost b=count}(duplicate:bok* OR name:bookk*)
book,books,book's,booked,bookseller

This returns good results, even though the user hit “k” twice.

For a second improvement, I’ve added a field which treats each sequence of three characters on the keyboard as equivalent: “q”, “w”, “e” or “r”, “t”, “y”, etc. Each letter of the word is replaced by the first of the sequence  – this yields “vqq” for the word “bee.”

{!boost b=count}(typos:vqq*) -name:bee
been,need,news,newspaper

These are all words that might have been intended, if your aim was a bit off. In the following contrived example, the user hit “b” instead of “n”:

{!boost b=count}(
   (name:bew)^2 
   or (typos:vqq* and -name:bew*)
   or (dup:bew* and -name:bew*)) -name:bew
been,new,need,news,newspaper
The ^2 provides a boost to correctly typed words, so users will only see corrections when nothing better is found.
To make the search sensitive to context, add a separate count for the app in question:
{!boost b=sum(appcount,log(count))}name:bee*

Using log(count) will still return records ordered by frequency, but allows the frequency of word use with in an app to override the real-world usage, since log(count) is less than seven or eight for all words in this dataset.

Adding new fields to the Solr index occupies more space, but it is fairly well compressed. The data from the Google n-grams dataset is 107 MB text / 31 MB zipped, and the Solr index with words, counts, and two types of typos is 80 MB. With further cleaning and optimization, much more data could be included.
Exit mobile version