Thursday, December 27, 2012

This Week's Coding Practice: Merge Sort

While sorting algorithms are very important from a programming point of view, they also frequently occur in our daily practical lives.  For example, each week, several students in our lab distribute notes that they will present at the weekly seminar.  The students distribute the notes independently, they invariably get out of order.  How do you put them back in order?  It's a sorting problem.

One way would be to put the notes in a new pile, represented by a linked list (a simple array is also possible).  You'd insert the notes into the pile one by one, making sure each note goes in the appropriate place and does not upset the pre-existing order.  What's the overhead for doing things this way?  Well, you have to search the pile and insert the new note, which takes $O(n)$ if we don't do anything clever.  Since you'll be doing that for all n items, the total cost of sorting the notes is $O(n^2)$, which isn't great.  In the word of programming, this is known as an insertion sort.  You can improve each insertion time by using a binary search tree to represent the pile.  This is known as a tree sort, and decreases the search and total sort time to $O(\log(n))$ and $O(n \log(n))$, respectively.  An auto-balancing BST works best, since in that case even the worst sort time is $O(n \log(n))$.

If you want $O(n \log(n))$ sort performance, but don't feel like mentally implementing an auto-balancing BST to sort a couple of notes, there are some alternatives.  My personal favorite is merge sort, which, incidentally, is this week's topic.  Merge sort is a divide-and-conquer algorithm that is best defined recursively: divide the items into two equally-sized piles, recursively sort each pile, then merge the sorted piles back together.  The recursion terminates when a pile cannot be divided any further (i.e. contains less than two elements).  What's the computational complexity of merge sort?  Since you halve the input array at each recursion depth, the total depth of recursion is $\lceil \log_2(n) \rceil$.  At each recursion depth, you also have to merge two sorted piles -- this can be done in linear time.  The overall sort thus takes $O(n \log(n))$ (since we don't care about the base of the logarithm).

This week's problem was to implement the merge step of the merge sort.  To make things more interesting, we're merging an arbitrary number of piles of arbitrary size, instead of two piles of roughly the same size.  Here it is, in Python (for a change):

Since I just couldn't help myself, I went ahead and implemented the entire merge sort in Python.  It doesn't share any code with the merge step mentioned above, but the principles are the same.

I'm using a temporary list to perform the merging.  It's possible to perform an in-place merge sort without requiring such temporary storage, but I haven't gotten around to it yet.

Friday, December 21, 2012

This Week's Coding Practice: Binary Search Trees

This week's topic, as you may have guessed already, was Binary Search Trees (BST).  A Binary Tree (BT) is a hierarchical structure in which each node has one parent (with the exception of the root node, which is an orphan) and at most two children.  Nodes without children are called leaf nodes.  A BST is a special kind of BT that satisfies the following invariant: an in-order traversal of the tree yields nodes in ascending order.  An auto-balancing BST is a BST that also satisfies another invariant: the height of the tree (the maximum number of links from the root to a leaf node) needs to be relatively small.

The first invariant allows the BST to be searched efficiently: you start at the root, and at each node, you compare the value you're searching for and the value of the node.  If they are equal, you've found your node and you're all done. If the former is less than the latter, then you continue down to the left child of the node; otherwise, you go down the right child of the node.  If you can't continue because a particular child doesn't exist, then the BST doesn't contain the value you're looking for.  How long does this retrieval operation take?  Well, at each step you go down one level, so you will need at most h steps, where h is the height of the tree.

This is where the second invariant comes in.  Consider this example: an ordered list of n numbers is being added, one by one, to a BST that does not satisfy the second invariant.  What will the resulting tree look like after the nth addition?  It will resemble a linked list n elements long.  Searching through this list will be at most O(n).  The benefit of using such a BST over a simple linked list would be lost.  If, on the other hand, the BST "magically" kept its height relatively small, then you would be able to perform more effective lookups.  How small is "relatively small?"  Well, consider the number of nodes n (unknown) for a simple BT with height h: you would have 1 node at the top (the root), 2 nodes on the second level (children of the root), 4 nodes at the third level (grandchildren of the root) and so on.  This is a geometric progression with a common ratio of 2 and an initial value of 1, which sums to $2^h - 1$.  This is the value of n.  Obviously, the relationship between h and n is: $h = \lceil \log_2(n+1) \rceil$ .  The ceiling is there to account for the fact that not all non-leaf nodes will have two children.  This is the optimal height of a BST.  In practice, keeping the height of the tree at precisely this value is overkill, so the invariant is considered satisfied if the height is within some factor of this optimal value.  The benefit of satisfying these two invariants is assurance that retrievals will take at most O(log(n)) of the number of nodes in the tree.  The above condition implies that the left and right subtrees must be of similar heights (i.e. balanced).  This is the reason that such BSTs are called auto-balancing.

One main question remains: how does an auto-balancing BST maintain its height to satisfy the second invariant?  The answer depends on the implementation, of which there are several.  I'll discuss this in more detail in the next post.  For now, I leave you with the solution for this week's problem: how do you check that a BT is a auto-balancing BST?

As an aside, yet another hooray for StackExchange, which got me up and running with LaTeX on Blogger in no time.

Thursday, December 20, 2012

Last Week's Coding Practice: Hash Tables

In an effort to keep myself in tip-top coding shape, I've started doing the weekly coding problems at http://codingforinterviews.com/. Each week, they set a topic, ask you a few questions and send you a simple problem to solve. They don't really check your answers or solution, but the main point is to get you thinking about important data structures, algorithms, etc. This is my second week in, and so far it seems worthwhile. Today I'd like to briefly mention last week's topic: Hash Tables, and document some of the things that I've learnt.

To me, hash tables have been the bread and butter of programming ever since learning about them. Constant-time lookups and insertion, linear storage overhead, complete freedom to choose the key and value types... what's there not to like, right? Unfortunately, while I was using hash tables practically everywhere, I neglected to learn about how they are actually implemented. Last week, that changed.

Back to School

First, I had to make some adjustments to my vocabulary. A hash table (or hashtable, hashmap, hash map, ...) is one way of implementing an associative array, which is a mapping a mapping between keys and values. Other implementations of associative arrays are possible: for example, you could do it using a linked list of (key, value) pairs, but it wouldn't be a particularly good implementation since insertion and retrieval would be O(n).

Next, a hash table consists of a hash function and a fixed-length array. The hash function takes a key, which can be of any data type, and returns an integer (the hash value). This hash function is deterministic (the same key always gets hashed to the same integer). Additionally, one of the most desirable properties of a hash function is that it returns unique hashes for unique keys -- different keys never get the same hash value. When two different keys hash to the same value, it is known as a collision. For good hash functions, collisions are rare in practice. The hash value is mod'ed (for lack of a better word) by the length of the fixed-length array, and the result is used to index into the array.

In a trivial case, each element (bin) of the array can simply hold the value that corresponds to a specific key. Due to collisions (and the mod operation), however, different keys can end up going into the same bin. While collisions are rare, they do occur, and a strategy to handle or reliably avoid them becomes necessary.

There are several ways to deal with a collision. I'll describe one of the simplest: each bin contains a linked list that holds (key, value) pairs. When inserting a (key, value) pair into the hash table you just append the pair to the linked list of that particular bin. When retrieving a value for a particular key, you traverse the entire list for the right bin until you hit the pair with the identical key. Appending a list and traversing it adds overhead to the hash table insertion and retrieval. However, this overhead is O(n) of the length of the linked list. Since collisions are rare, the linked lists will be fairly short, and this overhead will be minimal. Other collision handling strategies reduce this overhead, but are slightly more involved.

Lastly, while hash functions sound like magic, in essence they are quite simple: convert the key into a numeric representation of arbitrary length, and then reduce it to an integer. For numbers or numeric-like data types such as characters, you don't really need to do anything -- the hash is simply equal to the key. For compound data types (arrays, structs, strings), you can obtain the numeric representation by adding the individual elements. Addition alone doesn't evenly distribute the hash value over the entire domain of the hash function (integers), so add-and-multiply is often used instead.

Email me teh codez?

After learning all the stuff above, I was ready to have a try at implementing my own hash table. Here it is, in C, complete with Makefile and all: