How to implement a hash table

I've known for a long time that hash tables (aka "hash maps" or "dictionaries") were data structures for storing key/value pairs. A perfect example of this is a dictionary.

myDictionary = {
  dog: "A furry animal",
  Amazon: "A huge long river",
  fruitcake: "A gift made for giving",
  cat: "A furry animal that hates you"
}

The impressive thing about hash tables is that the time it takes to look up the value associated with any given key. The time it takes to find an item in a list is proportional to the length of a list. Finding an item in a sorted list or tree takes time proportional to the logarithm of the number of items in the list. Finding an entry in a hash of any size takes constant time! In other words, quite unlike the physical version of a dictionary, looking up words in this dictionary is about as fast if there are a billion word/definition pairs as it is if there are only ten!

How a hash look-up is independent of its size

This week, I finally learned how such things are possible in the process of implementing one as part of Catalyst*. The first step is to create a large array, one large enough to hold several times the largest number of items we expect to store in our hash.

var makeHashTable = function(){
  var storage = [];
  var max = 1000;
}

The crucial part is the second step. The second step is to create a function that converts any key (such as "dog" or "Amazon") into a number that appears random. This number is called a hash key. Make sure the number output by this function is no larger than your hash table's max size. (Note: Hash tables can be re-sized and those included in many dynamic languages are. It's an "interesting", i.e., difficult problem.)

var getIndexBelowMaxForKey = function(str, max){
  var hash = 0;
  if (str.length == 0) return hash;
  for (i = 0; i < str.length; i++) {
    hash = (hash<<5) - hash;
    hash = hash + str.charCodeAt(i);
    hash = hash & hash; // Convert to 32bit integer
  }
  return Math.abs(hash % max); //negative numbers will break stuff!
};

Using the index provided by our hashing function, we insert and retrieve values at the specified locations in our storage array. Since the storage index is generated algorithmically based on the key, the total number of items stored doesn't matter!

  var insert = function (key, value) {
    if(typeof(key) === "undefined") {
      throw("Cannot insert with undefined key!")
    }
    var hashIndex = getIndexBelowMaxForKey(key, max);
    storage[hashIndex] = value;
  };

var retrieve = function(key) {
  var hashIndex = getIndexBelowMaxForKey(key, max)
  return storage[hashIndex];
}

This is pretty cool. Since all of the code I've written above is JavaScript, you can even test it out in your browser now. Try getting the index for "dog" in a hash with 10k storage locations:

getIndexBelowMaxForKey ("dog", 10000);

It returns location 9644. Every time you run that line of code, the same index will be provided. If you use a larger max hash size, you'll get a larger number, but it won't take any longer to calculate. Adding or retrieving an entry from a ten trillion entry dictionary is lightning quick. Try it!

Collisions

There's still a serious flaw in the hash table implementation I've provided. Sometimes more than one key will generate the same hash index. When that happens, the second element you insert will over-write the first. If you're using a small max compared to the number of items to be placed in the hash it could happen frequently. With a properly sized storage array, collisions will be rare but they still happen.

The solution is to store linked lists at each bucket location instead of storing the element directly. Each node will have to contain both the element's key and its value. When retrieving a value, one more step is added to the process. First go to the location of provided by the hashing function-- that would be storage[indexFromFunction]. Then traverse the linked list stored there and compare the key of the element being looked up with the key of each element in the list. Once there's a match, return the associated value.

But what if it's Monday?

But what if you insert a hundred items into a size 1000 hash and they all collide!!??

Mondays suck, huh? If that happened, then the time efficiency of the hash table would be linear and even slightly worse than just using a linked list! The odds are astronomically low. It is a good reason to keep a relatively large hash size, though.

* Catalyst renamed itself to Hack Reactor part way through my studies there

How a hash look-up is independent of its size

Collisions

But what if it's Monday?

Share this:

Leave a Reply Cancel reply