In programs where there is some kind of global state, you will often find the need for having a key-value map; you could for example imagine keeping some kind of cache of a bunch of entries from database table. Obviously, you'd just use a hash table, easy right?
Not really. Imagine that there is multiple threads. One approach is to wrap it in a mutex to ensure thread safety, but that would kind of miss the point of concurrency: It wouldn't be concurrent, it would just be blocking. So, imagine we want a non-blocking, truly concurrent hash table. How'd we go about that?
Several ways to get there exists, unfortunately they all have their trade-offs. In this blog post, I am going represent the implementation of such table, which I came up with during the development of TFS. It is hopefully a sane way to solve the issue with good trade-offs: It has relatively good consistency guarantees, yet very fast.
To be clear, if you're just looking for a quick way to implement this, this is probably not the post for you. It is somewhat complicated, and you might be fine with a hash table with locks and striping. However, if you are looking for something scalable to an arbitrary number of CPU cores, this is for you.
I assume that you are familiar with hash tables and hash table design, as well as basic terms in concurrent programming.
To be able to actually approach the problem, we have to define a limited API for the hash table. Here's the obvious basics:
insert(key, value)- for, well, inserting into the table.
remove(key)- for removing a entry from the table.
get(key)- for querying a value from the table.
There are many ways the above set can be expanded, such as iteration, bulk operations, etc., but we will stick to those for simplicity.
As any data structure, you will have the structure (how it is organized, that is) and the algorithms manipulating it. It is only natural to start by asking: How should the keys and values be organized?
There is not really a single answer to that question, but there's a few things to keep in mind. First of all, we probably want to avoid reallocation. That thing is messy, and often impossible, to do atomically; you will almost always end up introducing some kind of lock, either explicitly or implicitly, so we must necessarily use a kind of linearly growing structure.
What if we used some kind of tree?
If you've seen my post about nested hash tables, you will know where I'm going, namely that we must organize a tree of tables, such that extending it, simply means placing a new table at a leaf of the tree.
The idea is this: Every table has some fixed number of buckets, each of which can be in one of following states:
- Empty - no entry, key, or value.
- Leaf - there is a key and value.
- Branch - the bucket "branches" to another table.
The structure itself is nothing, but a root table. The operations are defined on any table, and not just the root table, as this allows us to use recursion.
We'll see why in a second.
The biggest obstruction to designing concurrent hash table (or hash tables in general) is of course the collision resolution. When two keys collide (which will happen with almost certainty), there must be a way of resolving this collision, meaning to make them able to coexist in the same map.
In this structure, it is simply done by going from a leaf of the tree to a
branch, containing both key-value pairs (say a bucket is
Leaf(entry) and you
want to insert an entry there called
entry2, then the bucket is changed to
new_table has both entries; if the two keys still
collide in the new table, repeat).
The natural question to ask here is how we hash. Surely, if we simply used the same hash function throughout, a collision would keep happening, even after branches, as the keys would be assigned the same bucket, over and over again. Hence, we must choose a way, wherein we can produce distinct hashes for different tables.
The easiest way to do this is of course using one of these fancy "seeded hash functions", but even that has it's limitations: What if (like most of these functions) their independence isn't mathematically proven? What if there exists a key which would always generate a collision, despite of the seed? In other words, it relies on the correctness and quality of the underlying hash function, which we would like to avoid. In fact, even if we asssumed the correctness, it would still be suboptimal, due to an unnecessary amount of collisions, and on top of that, we'd be forced to recalculate the hash for every table, we traverse. We will come back to another way of producing the hashes in next section.
Searching is pretty obvious from here: We simply index by the hash until we find a leaf or empty bucket, in which case we're done.
Lastly, we have deleting, which is simply a matter of replacing a bucket with "empty".
So, how exactly do we generate the sequence (I say sequence, as in the indexes for each of the tables in traversing to the entry, namely the sequence of buckets to follow/jump)?
Well, we could start by formulating a property of correctness:
Given enough elements of the sequence, it shall be possible to reproduce the hashed value?
Since the sequence is never-ending (and hence has an infinite codomain), this is possible to satisfy, contrary to classical hash functions, with a finite codomain.
Another way of thinking about this is that such function is a function of byte strings to numbers in the interval, \(\left[0;1\right)\). This definition actually allows us to go one step further and define bijectiveness to create the ideal of such function.
Anyway, now that we know what we're looking for, we must find a way to actually find it.
The way I've done it is by having a state, which is altering based on the previous element; and a stack, which contains permuted input.
The idea is like this: First the stack is populated with the input, one-byte-by-one. Each byte XOR'd with the state and then permuted by some table. The generated element is pushed to the stack and is made the state.
When the stack is populated, elements of the sequence can be read by popping from the stack, and then permuting the popped element according to the state (i.e. XOR, then use lookup table). Again, this is set to the state.
So why exactly this thing? Doesn't it seem a bit arbitrary?
Well, there is a reason. First of all, the state is there for bringing in some context, such that the element in the output can't simply be predicted according to the same element in the input.
Secondly, from the stack, we can reconstruct the input sequence. And from the output sequence, we can reconstruct the stack, so the input sequence is reconstructible from the output sequence.
But why not simply have the output sequence be the stack? Well, without permuting it after each pop, entropy would only move upwards, i.e. the output would only depend on the elements before it in the input. The permutation in the pop brings in context of the elements after?
Is this optimal? Probably not, but it's fine, as it isn't exactly critical in
performance, kind of like how you don't optimize command-line parsing in
wget, the real hit comes from the atomics, not the hashing.
Making it concurrent
If you've noticed, so far we've really only constructed our table in terms of single-threaded algorithms.
Maybe it's obvious to the reader how to make it concurrent, maybe not. It depends on your experience with things, but I hope the text that follows will make it sensical to you.
We can represent each bucket in a node (table) as a pointer, which we only modify atomically. The null state represents the empty bucket. This allows us to do certain pretty convenient things.
Namely, it allows us to use CAS (compare-and-swap) to ensure that the entry, we're replacing, is indeed empty. If not, we get the non-matching value, which allows us to handle things case-by-case.
Searching is painfully obvious: Simply read and follow the appropriate buckets, until you get to a leaf, which is your final destination. It is a match, if the key of the leaf matches the key, you're searching for.
As described above, the insert routine starts by effectively CAS-ing the bucket of the key for null pointers (empty buckets), and replacing with the leaf containing the key and value, we're interested in inserting.
If it failed, we do the following:
- If the bucket is a leaf, try to CAS this bucket pointer to a branching table containing the leaf's K/V pair and the pair, we're inserting. If the CAS fails, start over with the updated value (see below).
- If the bucket is a table, simply go to that table and insert it there.
- If the bucket is an empty (this means there've been an ABA condition; another thread has removed the value in the meantime), start over with the updated value.
Removal is slightly more complicated.
First, the bucket is read. If it is a subtable, recurse and remove on that.
If not a table and not empty, then you must CAS for the pointer to the null pointer. If this fails, one must restart the process (ABA).
(Optional:) The ABA problem
I've mentioned the ABA problem a few times so far, and it turns out to be a very important issue of concurrent algorithms, but what does it mean?
In simple terms, it means that some data, we've read has changed in the meantime.
This is why I don't read and then write the modified version. Instead I read \(A\), then find \(f(A)\) and then try to CAS \(A\) to \(f(A)\). It can be a pain, but it's necessary. The reason for this is that the atomic, we read as \(A\), could have been changed before or while we evaluated \(f(A)\), hence meaning that \(f(A)\) would reverse progress another thread has made.
So the question (to which I have already revealed the answer) is, what happens if this CAS fails?
Well, we use a loop. If it fails, we use the new value and repeat our operation.
Doesn't this kind of defeat the point of concurrent algorithms? Well, in a sense, it does make it non-atomic, but it isn't really a spin-lock or anything like that, even though it can seem like that: It doesn't wait for the data to be in a state (like a spin-lock will do), it just repeats if the data failed to be substituted.
It is available
that it won't compile out-of-the-box right now, as it relies on some patches I
made to the
crossbeam library. Eventually, I will replace the dependency with
my own replacement for crossbeam.
- It cannot shrink. Hopefully, this isn't that big of an issue, given that the space is reusable, despite not being able to shrink. It is similar to certain balanced trees, which cannot shrink. In general, the space isn't lost, nor leads to unbounded memory usage.
- It is cache suboptimal. This cannot really be avoided. There will, likely, be more than one cache miss, during traversal. However, the added concurrency often pays that off performance-wise.
- It relies on some type of concurrent memory reclamation (e.g. hazard pointers) in order to free memory of removed entries.
Are these worth it? Well, it depends on what you're doing. In my case, these are all very much worth it, and I haven't had any issue with any of them, but every program is unique and will have its own set of conditions.
Follow me on Twitter or Github.