A trie is a tree-like data structure. The main difference between tries and BSTs is that tries have an arbitrary branching factor, whereas BST nodes can have at most two children. This allows faster lookups: a trie lookup is proportional to the length of the key, whereas a BST lookup is proportional to the number of items stored in the tree. Another advantage of tries is that they allow prefix-based lookups. For this reason, they are used in text processing for tasks such as word autocompletion.
Interestingly, a trie can also be used as an alternative to a hash table when implementing associative arrays. It has several advantages over a hash table: it does not require a hash function, and is thus faster for shorter keys; it maintains order within items that it stores. This alternative is particularly useful when many of the associative array keys are prefixes of other keys (for example, when the keys are English words).
For this week's coding practice, I coded up an autocompleter in Python. The autocompleter works in two steps: in the first step, the script initializes a trie by looks at a dictionary of English words (on Linux systems, one can often be found at /usr/share/dict/american-english). In the second step, the script prompts the user to input a prefix, and uses the trie to locate all words beginning with that prefix. The initialization step takes a little while (approximately 2 seconds for 100,000 words), but only needs to be done once. The lookup step is pretty much instantaneous: as long as the trie is in memory, the cost of lookups is linear to the length of the key.
The entire american-english file consists of 100K words and is approximately 1MiB. The trie used to hold this dictionary has over 200K nodes, and occupies a whopping 85MiB in memory. That's approximately 425 bytes per node: it's likely that this can be reduced. Personally, I suspect that an implementation in C would use less memory.
Here is the code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import sys | |
# | |
# Lessons learned: | |
# | |
# - must override object class to get __sizeof__ to work | |
# | |
class TrieNode(object): | |
def __init__(self, leaf): | |
self.children = dict() | |
self.leaf = leaf | |
def __sizeof__(self): | |
return object.__sizeof__(self) + sys.getsizeof(self.children) + sys.getsizeof(self.leaf) | |
def insert_word(trie, word): | |
if not word: | |
return | |
try: | |
child = trie.children[word[0]] | |
except KeyError: | |
child = TrieNode(len(word) == 1) | |
trie.children[word[0]] = child | |
insert_word(child, word[1:]) | |
def find_words(trie, prefix): | |
for c in prefix: | |
try: | |
trie = trie.children[c] | |
except KeyError: | |
return list() | |
root = (trie, prefix) | |
stack = [ root ] | |
words = list() | |
while stack: | |
node, cprefix = stack.pop() | |
if node.leaf: | |
words.append(cprefix) | |
for l in node.children: | |
child = (node.children[l], cprefix+l) | |
stack.append(child) | |
return words | |
def size_of_trie(root): | |
stack = [ root ] | |
bytes = 0 | |
nodes = 0 | |
while stack: | |
node = stack.pop() | |
nodes += 1 | |
bytes += sys.getsizeof(node) | |
for l in node.children: | |
stack.append(node.children[l]) | |
return bytes, nodes | |
if __name__ == "__main__": | |
if len(sys.argv) != 2: | |
print "usage: python %s words.txt" % __file__ | |
sys.exit(1) | |
with open(sys.argv[1]) as fin: | |
words = fin.read().strip().split("\n") | |
root = TrieNode(False) | |
for w in words: | |
insert_word(root, w) | |
bytes, nodes = size_of_trie(root) | |
print "read %d words (%d nodes, %dMiB). Enter a prefix to search, EOF to exit:" % (len(words), nodes, bytes/1e6) | |
while True: | |
try: | |
prefix = raw_input("> ") | |
print find_words(root, prefix) | |
except EOFError: | |
break |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
foo | |
foobar | |
fool | |
for | |
foot |
No comments:
Post a Comment