Sunday, February 24, 2013

This Week's Coding Practice: Stacks and the Tower of Hanoi

A stack is a common data type that allows its elements to be accessed in LIFO (last in, first out) order. Many parts of the physical world around us function like a stack. A common analogy is a stack of plates on a table: it's simple to put a new plate on top or remove the top-most plate. It's impossible to insert a new plate into the middle of the stack. It's also impossible to remove any plates other than the top-most one. I ran into yet another real-world example of a stack the other day: a narrow rectangular parking lot with only one side facing a driveway. The lot has just enough space to fit around 4-5 cars end-to-end. When people want to park, they just drive straight in. Eventually, they get parked in by someone, who also gets parked in by somebody else, and so on. Lucky last doesn't get parked in by anyone else, but gets nagged by all the people he parked in whenever they want to leave.

Another popular example of a stack is the famous Tower of Hanoi puzzle.  There are three rods, and an arbitrary number of disks of unique sizes.  Initially, all the disks are placed on the first rod, ordered by size (smallest disk at the top).  The goal of the puzzle is to move all the disks to another rod, one by one, without ever putting a larger disk on top of a smaller one.  The task for this week was to write an algorithm that solves this puzzle.

It may not be obvious from the description, but each peg can be modeled as a stack, since its physically impossible to access elements in any order other than LIFO (for example, randomly, or FIFO).  The entire puzzle is then modeled by 3 separate stacks.

How does one go about solving the puzzle?  It turns out that it has been thoroughly studied, and several well-known recursive and iterative algorithms exist.  However, since I have a tendency to do things the hard way, I didn't study those algorithms prior to solving the puzzle, and came up with a less elegant but home-brewed solution on my own.

Having spent a significant amount of time on the problems at Project Euler, I've acquired a instinctive reaction: whenever facing a problem for the first time, try brute force.  It's almost never the best (or even borderline satisfactory) solution, but is a fairly quick way of getting acquainted with a problem.  It's also a good starting point for trying out other things.  Lastly, it's better than nothing.

With that in mind, I approached the puzzle as a search problem (a more general instance of the binary search).  At each step, you represent the current state of the puzzle as 3 individual states.  To get to the next steps, a disks is moved from one stack to another, without violating the rules of the puzzle.  Once you get to a state in which all of the discs are on the second (or third) stack, then you've solved the puzzle.  In which order should we look at the next steps?  One simple way is DFS (depth-first search).

I quickly realized that a lot of the resulting steps weren't helpful in solving the puzzle: moving a single disc backwards and forwards between two rods achieves nothing.  It also causes some branches of the DFS to be infinitely long, meaning the search will fail to terminate.  One way to get around this problem is to search only those states that haven't been encountered yet.  This requires keeping track of the states encountered thus far, which could be done with a hash map.

As it turns out, DFS isn't the best way to perform the search, since the solution it yields isn't necessarily the simplest.  This can be addressed by switching to breadth-first search (BFS): the solution it yields will have the minimum possible number of steps (which, as it turns out, is equal to $2^{n-1}$).

After writing my solution out on paper, I quickly coded it up in Python.  Since I had a bit of spare time, I decided to try out my JavaScript chops and implement something I could show off graphically.  While fiddling with JS took me way longer than coming up with the actual algorithm, I got there in the end, and came up with this:

The search problem can be taken further by implementing an A* search that uses a heuristic to determine the best action to take at each step.  If a suitable heuristic can be formulated, this kind of search can significantly reduce the running time and memory usage of the algorithm.  It was tempting to keep digging, but I'll leave that to another day.

Full source (including the Python code) is available as a gist.

Friday, February 15, 2013

Coding Practice: Tries

I was 16 when I first got my hands on a mobile phone -- it was the year of the Sydney Olympics, and since I was a volunteer, I was spending a lot of time outside and needed a way to stay in touch with home base. I'm not 100% positive, but I think it was an Ericsson SH888. Calls weren't cheap back then, so SMS was the way to go. I remember the pain of having to enter English text through a numeric keypad without using a predictive text engine like T9. Of course, these days, predictive text (also known as auto-completion) is everywhere: mobile devices, web browsers, and office software. In today's article, I'll briefly discuss a data structure that can be used to implement auto-completion effectively and efficiently -- the trie.

A trie is a tree-like data structure. The main difference between tries and BSTs is that tries have an arbitrary branching factor, whereas BST nodes can have at most two children. This allows faster lookups: a trie lookup is proportional to the length of the key, whereas a BST lookup is proportional to the number of items stored in the tree. Another advantage of tries is that they allow prefix-based lookups. For this reason, they are used in text processing for tasks such as word autocompletion.

Interestingly, a trie can also be used as an alternative to a hash table when implementing associative arrays. It has several advantages over a hash table: it does not require a hash function, and is thus faster for shorter keys; it maintains order within items that it stores. This alternative is particularly useful when many of the associative array keys are prefixes of other keys (for example, when the keys are English words).

For this week's coding practice, I coded up an autocompleter in Python. The autocompleter works in two steps: in the first step, the script initializes a trie by looks at a dictionary of English words (on Linux systems, one can often be found at /usr/share/dict/american-english). In the second step, the script prompts the user to input a prefix, and uses the trie to locate all words beginning with that prefix. The initialization step takes a little while (approximately 2 seconds for 100,000 words), but only needs to be done once. The lookup step is pretty much instantaneous: as long as the trie is in memory, the cost of lookups is linear to the length of the key.

The entire american-english file consists of 100K words and is approximately 1MiB. The trie used to hold this dictionary has over 200K nodes, and occupies a whopping 85MiB in memory. That's approximately 425 bytes per node: it's likely that this can be reduced. Personally, I suspect that an implementation in C would use less memory.

Here is the code:

Finally, many a battle have been fought over how to pronounce the word "trie". One camp insists on pronouncing it the same as "tree", since the origins of the word are from "retrieval". Since a "tree" and a "trie" are slightly different things, another camp insists on pronouncing it as "try". Finally, others sidestep the issue completely and refer to tries as "prefix trees".

Wednesday, February 6, 2013

The Challenges of Character Encoding... in 2013

Back when I was a lad, Unicode wasn't as popular as it is now. The world lived in the Dark Ages, where each text character was usually represented by a single byte, regardless of the language. A single byte can take up to 256 different values, meaning that on its own, it could be used to represent at most 256 different characters. People wanted to communicate to each other in different languages, but the number of characters in all modern languages was far more than 256.  This posed a problem.

A commonly accepted solution was to agree that the first 128 character codes were fixed (the ASCII character set), while the last 128 were unspecified -- their values depended on the context. A swarm of character encodings emerged, each handling these last 128 characters differently. For my native Russian, there were at least two such encodings: KOI-8 and CP1251. When you got text in Russian, you typically had to guess which encoding it was in. If you got it wrong, all you saw was gibberish (and any English text, since that was handled the same regardless of the encoding). It was 50/50, so not too bad. There were also some utilities that helped you guess the encoding.

A significant limitation, of course, was that you couldn't freely mix languages in text. For example, working with Russian and Greek text in the same file was not possible. Greek used a different encoding, which wasn't compatible with KOI-8 nor CP1251. You could see Russian or Greek at any one time, but never both.

Fast-forward to 2013. Unicode has become almost universally accepted. It solved the problems above, introduced several new ones, but overall, made the world a better place. The days of staring at gibberish trying to work out which encoding it is in were gone. Things just worked. Or so I thought.

I recently had the pleasure of playing correspondence chess with my dad using a chess server. While the game itself was quite entertaining (I lost miserably), I quickly noticed that sending Unicode messages wasn't working -- ASCII characters got through, non-ASCII characters got replaced by a question mark. This forced us to write in transliterated Russian, which I hate with a passion:

What was obvious that while the Web site was capable of displaying Cyrillic (which I managed to enter using HTML character escapes, thanks to the site admin), it was ruthlessly clobbering the text after it was being submitted. After looking at it more closely, I realized the input textarea was part of a POST form, and clicking on the "Msg" button submitted the form. The message was, therefore, part of the POST payload. I confirmed that it was successfully URL-encoded and still readable by using Chrome's Developer tools. After that, all trace of the message is lost, but one thing is clear: the non-ASCII characters in the message died a horrible death.

While encountering a fairly popular site that could not properly handle Unicode in 2013 was amusing, what was even more amusing was my dialogue with the administrator of the site. While I do believe they were genuinely trying to help, their firm belief that a server-side encoding problem could be fixed by modifying the client configuration was disconcerting, to say the least. The more I talked to them, the more I understood that they had no idea how character encoding works. Unfortunately, they also had little desire to reason about it, which led to a rather tense dialogue:
... the language that you use to handle message input is controlled by your local configuration and this is something over which we have no control. All of our pages, in common with around 75% of all online content from many diverse sites, are configured to accept input that conforms with UTF-8 standards. As I have tried to point out, there may well be a local issue relevant to your local configuration, codepage setting, default character coding and so on. These are not issues for which we, as a UK based Chess playing site, normally provide support. In this case, we have offered more support and advice than would have been provided by many of our competitors.
I edited the quote above liberally, for brevity. It can, however, be summarized by a single word in the King's English: bollocks. What's really happening is this: after the form is submitted, the URL-encoded message is retrieved from the POST payload, and encoded into one of the antique character encodings from the Dark Ages. Since these encodings only represent characters with codes up to 255, and the Unicode Cyrillic characters sit around the 1000 mark, there's simply no way to represent those characters anymore. Whatever performs the encoding replaces such characters with a placeholder character, which in this case happens to be a question mark.

I mashed up some JavaScript to demonstrate the problem.  While I couldn't be bothered doing a full form POST, the code demonstrates the essence of the problem.  It's below.

Here's a live version you can interact with.  Case closed.

clobber the submitted text

Had I been a more patient man, I would have persevered against the onslaught of ignorance and carried on my crusade for working Unicode input to its glorious end. Instead, I just opened an account on another chess server.

Tuesday, February 5, 2013

Coding Practice: Depth-first Tree Traversal

A few weeks ago, I wrote about trees -- binary search trees, to be specific.  Working with a tree, for example inserting or retrieving elements, often involve scanning through the elements of the tree, starting at the root.  This is known as a traversal.

Unsurprisingly, there are many ways to perform a traversal.  A major distinguishing factor is the order in which nodes are visited: if all children of a node are visited before any of the node's siblings, then the traversal is known as depth-first.  If, on the other hand, the siblings are visited before the children, then the traversal is known as breadth-first.  In this article, I'll focus on describing depth-first traversals.

There are three main ways of traversing a binary tree depth-first: pre-order, in-order and post-order. They are typically defined recursively, with each step of the recursion consisting of three sub-steps: do something to the current node (this is referred to as visiting), traverse the left subtree, traverse the right subtree. By convention, the left subtree is traversed before the right subtree. Pre-order, in-order and post-order traversals perform the visit sub-step before, in-between and after the two subtree traversals, respectively.  Here's an example of performing each of the three traversal on a small tree.

The big three traversal algorithms can be implemented recursively or iteratively. Recursive implementations are easier to understand since they follow directly from definition, but can be less efficient than iterative implementations due to the function call overhead (for more details, see http://stackoverflow.com/questions/72209/recursion-or-iteration).

Picking the traversal algorithm to use depends on the application. For example, propagating changes from the leaf nodes to the root (e.g. calculating the sum of the tree) can only be accomplished with a post-order traversal, since the sum of each subtree needs to be known before the sum of the current node can be calculated. In contrast, propagating changes from the root to the leaves would be best done with a pre-order traversal.

The problems to solve this week were:
1. Implement a simple binary (non-search) tree node data structure in your favorite programming language and write the following methods: (1) print nodes pre-order, (2) print nodes in-order, (3) print nodes post-order.
2. Write a function that, given two nodes and a tree root, finds the two nodes' lowest common ancestor. That is, the function should find the ancestor that both nodes share that is furthest away from the root.
I went with a C++ implementation this time to take advantage of the STL's sets and maps. For finding the lowest common ancestor (LCA), I didn't use a parent pointer in the Node, and instead used an arbitrary traversal to calculate the parent of each node in the tree and store it in a map. This costs $O(n)$ for both space and time. Once that's done, finding all ancestors of one node and searching for the LCA both cost $O(log(n))$ given a data structure with a fast membership function (a set). The approach is thus $O(n)$, but can be reduced to $O(log(n))$ if the results of the traversal are pre-calculated and stored somewhere.  This pre-calculation would be practically identical to keeping parent pointers in each Node.

The code is below: