Software development

Data structures for fast autocomplete

Topics
search, autocomplete, algorithms, data structures

Autocomplete is a feature that suggests a complete word or phrase after a user has typed just a few letters. The feature increases text input speed especially on mobile devices, because one doesn't have to type every letter in a word. Autocomplete functionality is commonly found on search engines and messaging apps.

A good autocompleter must be fast and update the list of suggestions immediately after the user types the next letter. This blog post studies algorithms and data structures that are necessary for attaining a satisfactory speed.

An autocompleter can only suggest words that it knows about. The suggested words come from a predefined vocabulary, for example all distinct words in Wikipedia or in the Oxford English Dictionary. The number of words in the vocabulary can be large: Wikipedia contains millions of distinct words. The vocabulary grows even larger if the autocompleter is required to recognize multi-word phrases and entity names.

The heart of an autocompleter is a function that takes in the beginning of a word, or a prefix, and searches for vocabulary words that start with the given prefix. Typically, only a small number (about half a dozen) of possible completions are returned. There are many ways to implement such a function. Let's take a look at a few of the possibilities. We start from a straightforward but slow implementation and build progressively more efficient methods on top of that.

Alphabetical vocabulary

A naive implementation would iterate sequentially over the vocabulary words checking each in turn to see if it starts with the given prefix. However, this approach would need to compare the prefix against half of the vocabulary words on average. This approach is too slow for all but the smallest vocabularies.

A better implementation keeps the vocabulary words sorted in alphabetical order. Searching for a prefix in an ordered vocabulary is pretty fast with the help of a binary search algorithm. The binary search compares a query prefix against a middle element in the list to see if the prefix comes before or after the middle element. The binary search is then repeated recursively on the correct half of the list while the other half is ignored. Because every step of the binary search halves the range still to be searched, the total search time is proportional to the logarithm of the number of words in the vocabulary (as explained in this Stackoverflow answer).

Searching for words starting with broc-. Binary search concludes that words must belong to the top portion of the list by comparing the middle element against the prefix broc-.

Binary search is usually used to find complete words but it works also for prefixes because in an alphabetical list all prefixes are also in order (that is, all words beginning with broc- come before word beginning with brog-).

An ordered list of words is an especially attractive data structure when implementing an autocomplete on a search engine because the search engine already maintains such a list as part of its inverted index.

Binary search performs pretty well, but it turns out we can do even better.

Prefix tree

Typically many vocabulary words begin with the same prefix (broccoli, broker, brother and many other words all start with bro-). It seems wasteful to have to store the common prefix separately for each word.

A prefix tree is a data structure that exploits the shared prefixes to speed up the completion. Prefix tree arranges a set of words in a tree of nodes. The words are stored along paths from the root to leaf nodes. The edges corresponds to the letters, and the level on the tree corresponds to the letter position of a prefix. The first letter of the word defines which edge from the root to follow. The second letter defines which edge to take from the child node, and so on.

A prefix tree for words need, nested, seed and speed.

The completions of a prefix are found by first following the path defined by the letters of the prefix. This will end up in some inner node. For example, in the pictured prefix tree, the prefix ne- corresponds to the path of taking the left edge (labeled with N) from the root and the sole edge (labeled with E) from the child node. The completions can then be generated by continuing the traversal to all leaf nodes that can be reached from the inner node. In the picture, the completions of ne- are the two branches reachable from the inner node: -ed and -sted. If a path defined by the prefix is not found in the tree, the vocabulary does not contain words beginning with that prefix.

Searching in a prefix tree is extremely fast. The number of needed comparison steps to find a prefix is the same as the number of letters in the prefix. Particularly, this means that the search time is independent of the vocabulary size. Therefore, prefix trees are suitable even for large vocabularies. Prefix trees provide substantial speed improvements over ordered lists (where the search time is proportional to the logarithm of the vocabulary size, as noted above). The improvement is realized because each comparison is able to prune a much larger fraction of the search space (all continuations except those related to single letter).

Minimal deterministic finite automaton

Prefix trees handle common prefixes efficiently, but other shared word parts are still stored separately in each branch. For example, suffixes, such as -ing and -ion, are common in the English language. Luckily, there is a way to store shared word parts more economically.

A prefix tree is an instance of a class of more general data structures called acyclic deterministic finite automata (DFA). There are well-established algorithms for transforming a DFA into an equivalent DFA that has as few nodes as possible.

The minimal DFA for words need, nested, seed and speed consists of only 9 nodes compared to 17 nodes in the original prefix tree in the previous picture.

Minimizing a prefix tree DFA reduces the size of the data structure. A minimal DFA typically fits in the memory even when the vocabulary is large. Avoiding expensive disk accesses is key to a lightning fast autocomplete.

Extensions

The above describes how to implement a basic autocomplete functionality. These data structures can be extended in many ways to improve the user experience.

Often there are more possible completions for a particular prefix than what can be presented on the user interface. We might prefer to show the most commonly selected or otherwise most valuable completions first. This can be achieved by including an integer weight, which represents the value of a word, for each word in the vocabulary, and by showing the completions with the highest weight first. (For sorted lists and prefix trees it is trivial to store a weight in the leaf nodes. DFAs are more complicated, because a leaf node can be reached through several paths. A solution is to use a finite state transducer that associates a weight to a path instead of a leaf node.)

Prefix tree based completions can be extended to tolerate small spelling errors by following not only the exact path defined by a given prefix but also some neighboring paths with minor spelling variations.

A fixed vocabulary might be too limiting in some applications. Words entered by the user are prime candidates for completions in the future. To that end, the entered words can be inserted into the autocomplete data structure. Modifying the full data structure might be too expensive, however. A solution is to keep two data structures: one with a fixed global vocabulary and another with that user's words. The latter is much smaller and therefore faster to update.

While the autocomplete data structures are interesting, it is not necessary to implement them yourself. There are open source libraries that provide the functionality. For example, Elasticsearch and Solr and other search engines based on Lucene provide an efficient and robust autocomplete functionality.