Computer Science on Ticki's blog
http://ticki.github.io/tags/computer-science/index.xml
Recent content in Computer Science on Ticki's blogHugo -- gohugo.ioen-usCollision Resolution with Nested Hash Tables
http://ticki.github.io/blog/collision-resolution-with-nested-hash-tables/
Thu, 16 Feb 2017 00:00:00 +0000http://ticki.github.io/blog/collision-resolution-with-nested-hash-tables/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<h1 id="collision-resolution">Collision resolution</h1>
<p>Hash collisions in hash tables are unevitable, and therefore every proper implementation needs a form of <em>collision resolution</em>. Collision resolution is the name of the class of algorithms and techniques used to organize and resolve the case where two entries in the table hash to the same bucket.</p>
<p>It turns out that the choice and implementation of collision resolution is absolutely critical for the performance of the table, because while hash tables are often mistaken for having <span class="math">\(O(1)\)</span> lookups, they do in reality and theory have a sligthly more complicated behavior.</p>
<p>There are really two big competing families of algorithms of resolving collisions:</p>
<ol start="2">
<li>Open addressing: This keeps all entries in the bucket array, but applies some way of finding a new bucket if the ideal one is occupied. To keep the load factor low, they generally reallocate before they get full. This is an OK solution when doing things in-memory, but on-disk this absolutely sucks, since you potentially have millions of entries. If you plot the lookup time over the number of entries, it will look like an increasing line, which suddenly peaks and then falls, with the peaks getting more and more uncommon. This rather complex performance behavior can also make them unfit for certain purposes.</li>
<li>Chaining: This uses some structure for storing multiple entries in a bucket. Often through B-trees or linked lists.</li>
</ol>
<p>In this post, we will look at chaining.</p>
<h1 id="nested-tables">Nested tables</h1>
<p>What if we used another hash table for multi-entry buckets?</p>
<p>The idea is that we have some sequence of independent (this is a very important invariant) hash functions <span class="math">\(\{h_n(x)\}\)</span>, and the root table uses hash function <span class="math">\(h_0(k)\)</span> to assign a bucket to key <span class="math">\(k\)</span>. If the entry is empty, the value is simply inserted there and the bucket's tag is set to "single".</p>
<p>If however the bucket already has at least one entry, it will be inserted into a hash table, placed in said bucket, and the bucket where the entry will be (in the new hash table) is determined by <span class="math">\(h_1(k)\)</span>.</p>
<p>This process (<span class="math">\(h_0(k)\)</span>, <span class="math">\(h_1(k)\)</span>, <span class="math">\(h_2(k)\)</span>...) repeats until a free bucket is found.</p>
<p>This seems like a pretty obvious thing, but when you think about it it has some interesting properties for block-based on-disk structures, as we will discuss later.</p>
<h1 id="analysis">Analysis</h1>
<p>The analysis is pretty simple: If the hash functions are sufficiently good and independent, the space usage has a (amortized) linear upper-bound.</p>
<p>The lookup speed is perhaps more important than the space, and it is in many ways similar to B+ trees, except that this is simpler and perhaps faster. Both have logarithmic complexity of lookups, and more importantly, the base of the logarithm is usually pretty high (although it depends on the choice of default table size).</p>
<h1 id="blockbased-ondisk-storage">Block-based on-disk storage</h1>
<p>Because block-based storage is so convenient, it is used in many database systems and file systems. It is one of the reasons B+ trees is such a popular indexing method.</p>
<p>In particular, if a single block can store N pointers, we can simply let one table have N buckets, meaning there is no waste of space.</p>
<p>Compared to B+ trees, there is a pretty clear advantage in the case of dynamically sized keys, since comparing requires loading, which is generally a very expensive task. Hashing on the other hand is easy, and never requires any comparison (with the exception of the last table, perhaps¹).</p>
<p>¹it is popular to use cryptographic fingerprints to avoid the inconvenience of arbitrary-sized keys.</p>
SeaHash: Explained
http://ticki.github.io/blog/seahash-explained/
Thu, 08 Dec 2016 00:00:00 +0000http://ticki.github.io/blog/seahash-explained/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>So, not so long ago, I designed <a href="https://github.com/ticki/tfs/tree/master/seahash">SeaHash</a>, an alternative hash algorithm with performance better than most (all?) of the existing non-cryptographic hash functions available. I designed it for checksumming for a file system, I'm working on, but I quickly found out it was sufficient for general hashing.</p>
<p>It blew up. I got a lot of cool feedback, and yesterday it was picked as <a href="https://this-week-in-rust.org/blog/2016/12/06/this-week-in-rust-159/">crate of the week</a>. It shows that there is some interest in it, so I want to explain the ideas behind it.</p>
<h1 id="hashing-an-introduction">Hashing: an introduction</h1>
<p>The basic idea of hashing is to map data with patterns in it to pseudorandom values. It should be designed such that only few collisions happen.</p>
<p>Simply put, the hash function should behave like a pseudorandom function (PRF) producing a seemingly random stream from a non-random stream. It is similar to pseudorandom functions, in a sense, with difference that they must take variable input.</p>
<p>Formally, perfect PRFs are defined as follows:</p>
<blockquote>
<p><span class="math">\(f : \{0, 1\}^n \to \{0, 1\}^n\)</span> is a perfect PRF if and only if given a distribution <span class="math">\(d : \{0, 1\}^n \to \left[0,1\right]\)</span>, <span class="math">\(f\)</span> maps inputs following the distribution <span class="math">\(d\)</span> to the uniform distribution.</p>
</blockquote>
<p>Note that there is a major difference between cryptographic and non-cryptographic hash functions. SeaHash is not cryptographic, and that's very important to understand: It doesn't aim to be. It aims to give a good hash quality, but not cryptographic guarentees.</p>
<h1 id="constructing-a-hash-function-from-a-prf">Constructing a hash function from a PRF</h1>
<p>There are various ways to construct a variable-length hash function from a PRF. The most famous one is Merkle–Damgård construction. We will focus on this.</p>
<p>There are multiple ways to do Merkle–Damgård construction. The most famous one is the wide-pipe construction. It works by having a state, which combined with one block of the input data at a time. The final state will then be the hash value. This combining function is called a "compression function". It takes two blocks of same length and maps it to one: <span class="math">\(f : \{0, 1\}^n \times \{0, 1\}^n \to \{0, 1\}^n\)</span>.</p>
<p><span class="math">\[h = f(f(f(\ldots, b_0), b_1), b_2)\]</span></p>
<p>It is important that this compression emits pseudorandom behavior, and that's where PRFs comes in. For general-purpose hash function where we don't care about security, the construction usually looks like this:</p>
<p><span class="math">\[f(a, b) = p(a \oplus b)\]</span></p>
<p>This of course is commutative, but that doesn't matter, because we don't need non-commutativity in the Merkle–Damgård construction.</p>
<h1 id="choosing-a-prf">Choosing a PRF</h1>
<p>The <a href="http://www.pcg-random.org/">PCG family of PRFs</a> is my favorite PRF I've seen so far:</p>
<p><span class="math">\[\begin{align*}
x &\gets p \otimes x \\
x &\gets x \oplus ((x \gg 32) \gg (x \gg 60)) \\
x &\gets p \otimes x
\end{align*}\]</span></p>
<p>(<span class="math">\(\otimes\)</span> means modular multiplication)</p>
<p>The PCG paper goes into depth on why this. In particular, it is a quite uncommon to use these kinds of non-fixed shifts.</p>
<p>This is a bijective function, which means that we can't ever have less entropy than the input, which is a good property to have in a hash function.</p>
<h1 id="parallelism">Parallelism</h1>
<p>This construction of course relies on dependencies between the states, rendering it impossible to parallelize.</p>
<p>We need a way to be able to independently calculate multiple block updates. With a single state, this is simply not possible, but fear not, we can add multiple states.</p>
<p>Instruction-level parallelism means that we don't even need to fire up multiple threads (which would be quite expensive), but simply exploit CPU's instruction pipelines.</p>
<p><figure><img src="http://ticki.github.io/img/seahash_state_update_diagram.svg" alt="A diagram of the new design."></figure></p>
<p>In the above diagram, you can see a 4-state design, where every state except the first is shifted down. The first state (<span class="math">\(a\)</span>) gets the last one (<span class="math">\(d\)</span>) after being combined with the input block (<span class="math">\(D\)</span>) through our PRF:</p>
<p><span class="math">\[\begin{align*}
a' &= b \\
b' &= c \\
c' &= d \\
d' &= f(a \oplus D) \\
\end{align*}\]</span></p>
<p>It isn't obvious how this design allows parallelism, but it has to do with the fact that you can unroll the loop, such that the shifts aren't needed. In particular, after 4 rounds, everything is back at where it started:</p>
<p><span class="math">\[\begin{align*}
a &\gets f(a \oplus B_1) \\
b &\gets f(b \oplus B_2) \\
c &\gets f(c \oplus B_3) \\
d &\gets f(d \oplus B_4) \\
\end{align*}\]</span></p>
<p>If we take 4 rounds every iteration, we get 4 independent state updates, which are run in parallel.</p>
<p>This is also called an <em>alternating 4-state Merkle–Damgård construction</em>.</p>
<h2 id="finalizing-the-four-states">Finalizing the four states</h2>
<p>Naively, we would just XOR the four states (which have difference initialization vectors, and hence would not commute).</p>
<p>There are some issues: What if the input doesn't divide our 4 blocks? Well, the simple solution is of course padding, but that gives us another issue: how do we distinguish between padding and normal zeros?</p>
<p>We XOR the length with the hash value. Unfortunately, this is obviously not discrete, since appending another zero would then only affect the value slightly, so we need to run it through our PRF:</p>
<p><figure><img src="http://ticki.github.io/img/seahash_finalization_diagram.svg" alt="SeaHash finalization"></figure></p>
<p>One concern I've heard is that XOR is commutative, and hence permuting the states wouldn't affect the output. But that's simply not true: Each state has distinct initial value, making each lane hash differently.</p>
<h1 id="putting-it-all-together">Putting it all together</h1>
<p>We can finally put it all together:</p>
<p><a href="http://ticki.github.io/img/seahash_construction_diagram.svg"><figure><img src="http://ticki.github.io/img/seahash_construction_diagram.svg" alt="SeaHash construction"></figure></a></p>
<p>(click to zoom)</p>
<p>You can see the code and benchmarks <a href="https://github.com/ticki/tfs/tree/master/seahash">here</a>.</p>