Collision Resolution with Nested Hash Tables

Thu, Feb 16, 2017

Collision resolution

Hash collisions in hash tables are unevitable, and therefore every proper implementation needs a form of collision resolution. Collision resolution is the name of the class of algorithms and techniques used to organize and resolve the case where two entries in the table hash to the same bucket.

It turns out that the choice and implementation of collision resolution is absolutely critical for the performance of the table, because while hash tables are often mistaken for having $O(1)$ lookups, they do in reality and theory have a sligthly more complicated behavior.

There are really two big competing families of algorithms of resolving collisions:

Open addressing: This keeps all entries in the bucket array, but applies some way of finding a new bucket if the ideal one is occupied. To keep the load factor low, they generally reallocate before they get full. This is an OK solution when doing things in-memory, but on-disk this absolutely sucks, since you potentially have millions of entries. If you plot the lookup time over the number of entries, it will look like an increasing line, which suddenly peaks and then falls, with the peaks getting more and more uncommon. This rather complex performance behavior can also make them unfit for certain purposes.
Chaining: This uses some structure for storing multiple entries in a bucket. Often through B-trees or linked lists.

In this post, we will look at chaining.

Nested tables

What if we used another hash table for multi-entry buckets?

The idea is that we have some sequence of independent (this is a very important invariant) hash functions $\{h_n(x)\}$ , and the root table uses hash function $h_0(k)$ to assign a bucket to key $k$ . If the entry is empty, the value is simply inserted there and the bucket's tag is set to "single".

If however the bucket already has at least one entry, it will be inserted into a hash table, placed in said bucket, and the bucket where the entry will be (in the new hash table) is determined by $h_1(k)$ .

This process ( $h_0(k)$ , $h_1(k)$ , $h_2(k)$ ...) repeats until a free bucket is found.

This seems like a pretty obvious thing, but when you think about it it has some interesting properties for block-based on-disk structures, as we will discuss later.

Analysis

The analysis is pretty simple: If the hash functions are sufficiently good and independent, the space usage has a (amortized) linear upper-bound.

The lookup speed is perhaps more important than the space, and it is in many ways similar to B+ trees, except that this is simpler and perhaps faster. Both have logarithmic complexity of lookups, and more importantly, the base of the logarithm is usually pretty high (although it depends on the choice of default table size).

Block-based on-disk storage

Because block-based storage is so convenient, it is used in many database systems and file systems. It is one of the reasons B+ trees is such a popular indexing method.

In particular, if a single block can store N pointers, we can simply let one table have N buckets, meaning there is no waste of space.

Compared to B+ trees, there is a pretty clear advantage in the case of dynamically sized keys, since comparing requires loading, which is generally a very expensive task. Hashing on the other hand is easy, and never requires any comparison (with the exception of the last table, perhaps¹).

¹it is popular to use cryptographic fingerprints to avoid the inconvenience of arbitrary-sized keys.