To me, hash tables have been the bread and butter of programming ever since learning about them. Constant-time lookups and insertion, linear storage overhead, complete freedom to choose the key and value types... what's there not to like, right? Unfortunately, while I was using hash tables practically everywhere, I neglected to learn about how they are actually implemented. Last week, that changed.
Back to School
First, I had to make some adjustments to my vocabulary. A hash table (or hashtable, hashmap, hash map, ...) is one way of implementing an associative array, which is a mapping a mapping between keys and values. Other implementations of associative arrays are possible: for example, you could do it using a linked list of (key, value) pairs, but it wouldn't be a particularly good implementation since insertion and retrieval would be O(n).
Next, a hash table consists of a hash function and a fixed-length array. The hash function takes a key, which can be of any data type, and returns an integer (the hash value). This hash function is deterministic (the same key always gets hashed to the same integer). Additionally, one of the most desirable properties of a hash function is that it returns unique hashes for unique keys -- different keys never get the same hash value. When two different keys hash to the same value, it is known as a collision. For good hash functions, collisions are rare in practice. The hash value is mod'ed (for lack of a better word) by the length of the fixed-length array, and the result is used to index into the array.
In a trivial case, each element (bin) of the array can simply hold the value that corresponds to a specific key. Due to collisions (and the mod operation), however, different keys can end up going into the same bin. While collisions are rare, they do occur, and a strategy to handle or reliably avoid them becomes necessary.
There are several ways to deal with a collision. I'll describe one of the simplest: each bin contains a linked list that holds (key, value) pairs. When inserting a (key, value) pair into the hash table you just append the pair to the linked list of that particular bin. When retrieving a value for a particular key, you traverse the entire list for the right bin until you hit the pair with the identical key. Appending a list and traversing it adds overhead to the hash table insertion and retrieval. However, this overhead is O(n) of the length of the linked list. Since collisions are rare, the linked lists will be fairly short, and this overhead will be minimal. Other collision handling strategies reduce this overhead, but are slightly more involved.
Lastly, while hash functions sound like magic, in essence they are quite simple: convert the key into a numeric representation of arbitrary length, and then reduce it to an integer. For numbers or numeric-like data types such as characters, you don't really need to do anything -- the hash is simply equal to the key. For compound data types (arrays, structs, strings), you can obtain the numeric representation by adding the individual elements. Addition alone doesn't evenly distribute the hash value over the entire domain of the hash function (integers), so add-and-multiply is often used instead.
Email me teh codez?
After learning all the stuff above, I was ready to have a try at implementing my own hash table. Here it is, in C, complete with Makefile and all:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdlib.h> | |
#include <stdint.h> | |
#include <assert.h> | |
#include <string.h> | |
#include <stdio.h> | |
/* | |
* Lessons learnt: | |
* - need semicolon after struct declaration | |
* - calloc accepts size_t and number of elements as separate parameters | |
* - check levels of indirections in Hashmap struct | |
* - strlen doesn't include the null termination character | |
*/ | |
/* | |
* A bin represents a unique (key, value) pair stored in a single bin of | |
* a Hashmap. Since multiple pairs can be assigned to a single bin, they are | |
* connected by a linked list structure. | |
*/ | |
struct Bin | |
{ | |
char *key; | |
char *value; | |
struct Bin *next; | |
}; | |
/* | |
* A Hashmap is a linked list of bins. The capacity determines the maximum | |
* value of the key. | |
*/ | |
struct Hashmap | |
{ | |
size_t capacity; | |
struct Bin *bins; | |
}; | |
/* | |
* Allocate a new bin on the heap. Will trigger an assertion on | |
* out-of-memory. | |
*/ | |
struct Bin * | |
bin_new(char *key, char *value); | |
/* | |
* Allocate the bins for the specified Hashmap. Assumes the Hashmap is | |
* allocated, but currently uninitialized. | |
*/ | |
void | |
hashmap_init(struct Hashmap *map, size_t capacity); | |
/* | |
* Insert the specified (key, value) pair into the Hashmap. | |
* If an item with the specified key already exists, it will be overwritten. | |
*/ | |
void | |
hashmap_insert(struct Hashmap *map, char *key, char *value); | |
/* | |
* Retrieve the value with the specified key from the Hashmap. If it is not | |
* found, returns 0. If it is found, sets the value. | |
*/ | |
int | |
hashmap_retrieve(struct Hashmap *map, char *key, char **value); | |
/* | |
* Calculate the hash value for a key. | |
*/ | |
size_t | |
hash_key(char *key); | |
struct Bin * | |
bin_new(char *key, char *value) | |
{ | |
struct Bin *p = calloc(sizeof(struct Bin), 1); | |
assert(p); | |
p->key = key; | |
p->value = value; | |
return p; | |
} | |
void | |
hashmap_init(struct Hashmap *map, size_t capacity) | |
{ | |
assert(map); | |
map->bins = calloc(sizeof(struct Bin), capacity); | |
assert(map->bins); | |
map->capacity = capacity; | |
} | |
void | |
hashmap_insert(struct Hashmap *map, char *key, char *value) | |
{ | |
size_t idx; | |
struct Bin *current = NULL; | |
char *heap_value = NULL; | |
char *heap_key = NULL; | |
assert(map); | |
assert(key); | |
assert(value); | |
heap_value = calloc(sizeof(char), strlen(value) + 1); | |
assert(heap_value); | |
strcpy(heap_value, value); | |
idx = hash_key(key) % map->capacity; | |
if (!map->bins[idx].key) | |
{ | |
// | |
// This is an empty bin. Initialize it. | |
// The +1 is to account for the null termination character, which isn't | |
// counted by strlen. | |
// | |
map->bins[idx].key = calloc(sizeof(char), strlen(key) + 1); | |
assert(map->bins[idx].key); | |
strcpy(map->bins[idx].key, key); | |
map->bins[idx].value = heap_value; | |
return; | |
} | |
for (current = &map->bins[idx]; current->next; current = current->next) | |
{ | |
// | |
// Overwrite existing value. | |
// | |
if (strcmp(current->key, key) == 0) | |
{ | |
free(current->value); | |
current->value = heap_value; | |
return; | |
} | |
} | |
heap_key = calloc(sizeof(char), strlen(key) + 1); | |
strcpy(heap_key, key); | |
current->next = bin_new(heap_key, heap_value); | |
} | |
int | |
hashmap_retrieve(struct Hashmap *map, char *key, char **value) | |
{ | |
size_t idx = hash_key(key) % map->capacity; | |
struct Bin *current = &map->bins[idx]; | |
if (!current->key) | |
{ | |
// | |
// This is an empty bin. | |
// | |
return 0; | |
} | |
for (; current; current = current->next) | |
{ | |
if (strcmp(current->key, key) == 0) | |
{ | |
*value = current->value; | |
return 1; | |
} | |
} | |
return 0; | |
} | |
size_t | |
hash_key(char *key) | |
{ | |
size_t hash = 0; | |
int i; | |
for (i = 0; i < strlen(key); ++i) | |
hash = (hash << 4) + key[i]; | |
return hash; | |
} | |
int | |
main(void) | |
{ | |
/* | |
* Assign a random integer to each word in the system dictionary. | |
* For a single word ("hello"), print the random integer during assignment. | |
* Finally, fetch the integer for that word from the hashmap. | |
* Since the fetched integer is the same, the map functions correctly. | |
*/ | |
char random[256]; | |
char line[256]; | |
FILE *fin = NULL; | |
char *value = NULL; | |
struct Hashmap map; | |
int ret; | |
hashmap_init(&map, 65535); | |
fin = fopen("/usr/share/dict/words", "r"); | |
assert(fin); | |
for (;;) | |
{ | |
fgets(line, 256, fin); | |
if (feof(fin)) | |
break; | |
sprintf(random, "%d", rand()); | |
if (strcmp("hello\n", line) == 0) | |
printf("in:\tmap[\"hello\"] = \"%s\"\n", random); | |
hashmap_insert(&map, line, random); | |
} | |
fclose(fin); | |
ret = hashmap_retrieve(&map, "hello\n", &value); | |
assert(ret); | |
printf("out:\tmap[\"hello\"] = \"%s\"\n", value); | |
return 0; | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CFLAGS=-ggdb -Wall | |
all: hashmap.out | |
# $< the dependencies | |
# $@ the target | |
hashmap.out: hashmap.o | |
gcc -Wall -ggdb < -o @ | |
clean: | |
rm -rf *.out *.o |
No comments:
Post a Comment