Hash Functions on Ticki's blog
http://ticki.github.io/tags/hash-functions/index.xml
Recent content in Hash Functions on Ticki's blogHugo -- gohugo.ioen-usA general construction for rolling hash functions
http://ticki.github.io/blog/a-general-construction-for-rolling-hash-functions/
Thu, 02 Mar 2017 00:00:00 +0000http://ticki.github.io/blog/a-general-construction-for-rolling-hash-functions/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<h1 id="what-is-a-rolling-hash-function">What is a rolling hash function?</h1>
<p>A hash function is a function <span class="math">\(h : S^\times \to F\)</span> with <span class="math">\(S, F\)</span> being some finite sets.</p>
<p>A rolling hash function is really a set of functions <span class="math">\((h, u)\)</span>, where <span class="math">\(u\)</span> allows retroactively updated a symbol</p>
<p><span class="math">\[h(\ldots a \ldots) \mapsto h(\ldots a' \ldots)\]</span></p>
<p>To put it more formally, a rolling hash function has an associated function <span class="math">\(u : F \times S^2 \times \mathbb N \to F\)</span>, satisfying</p>
<p><span class="math">\[u(n, a, a', h(\underbrace{\ldots}_n a \ldots)) = h(\underbrace{\ldots}_n a' \ldots)\]</span></p>
<h2 id="an-example">An example</h2>
<p>One of my favorite examples of a rolling hash function is the Rabin-Karp rolling hash.</p>
<p>Essentially, you pick some prime <span class="math">\(p\)</span> and do following operation (over <span class="math">\(\mathbb Z_n\)</span>):</p>
<p><span class="math">\[h(\{c_n\}) = c_1 p^{k - 1} + c_2 p^{k - 2} + \ldots + c_k\]</span></p>
<p>You might be able to figure out how you can construct <span class="math">\(u\)</span>.</p>
<p><span class="math">\[u(n, x, x', H) = H + (x' - x) p^{k - n}\]</span></p>
<p>So why isn't this a pretty good choice? Well, it's</p>
<ol>
<li>Slow. Doing the exponentation can be quite expensive.</li>
<li>It's relatively poor quality. This can be shown by looking at the behavior of the bits: Multiplication never affects lower bits, so it's avalanche effect is very weak.</li>
</ol>
<p>It has a really nice property though, you can use an arbitrary substring of the input and the substring's hash and replace it in <span class="math">\(O(1)\)</span>, whereas most other rolling hash functions requires <span class="math">\(O(n)\)</span>.</p>
<h1 id="a-generalpurpose-construction">A general-purpose construction</h1>
<p>So, is there a general way we can come up with these?</p>
<p>Well, what if we had some family of permutations, <span class="math">\(\sigma_n : F \to F\)</span>?</p>
<p>Assume our <span class="math">\(F\)</span> is an abelian group with some operation <span class="math">\(+\)</span> (could be addition or XOR or a third option).</p>
<p>Then, construct the hash function</p>
<p><span class="math">\[h(\{x_n\}) = \sum_n \sigma_n(x_n)\]</span></p>
<p>Now, we can easily construct <span class="math">\(u\)</span>:</p>
<p><span class="math">\[u(n, x, x', H) = H - \sigma_n(x_n) + \sigma_n(x'_n)\]</span></p>
<h1 id="xor-special-case">XOR special case</h1>
<p>As programmers, we love XOR, because it is so simple, and even better: Every element is its own inverse, under XOR.</p>
<p>Namely, under XOR, <span class="math">\(u\)</span> would look like</p>
<p><span class="math">\[u(n, x, x', H) = H \oplus \sigma_n(x_n) \oplus \sigma_n(x'_n)\]</span></p>
<h1 id="rabinkarp-as-a-special-case">Rabin-Karp as a special case</h1>
<p>The interesting thing is that we can see Rabin-Karp as a special case, namely the family of permutations,</p>
<p><span class="math">\[\sigma_n(x) \equiv xp^{n} \pmod m\]</span></p>
<p>The reason this is a permutation is because <span class="math">\(p\)</span> is odd, hence <span class="math">\(p^n\)</span> is odd, and must therefore have a multiplicative inverse in <span class="math">\(\mathbb Z/m \mathbb Z\)</span>.</p>
<p>Now, why does <span class="math">\(p\)</span> have to be a prime? Well, Every permutation must be distinct, <span class="math">\(f(x) \equiv p^x \pmod m\)</span> is a permutation itself (which can be shown relatively easily through basic group theory).</p>
<h1 id="statistical-properties-and-qualities">Statistical properties and qualities</h1>
<ul>
<li>Flipping a single bit will change the output, no matter what: If <span class="math">\(x \neq x'\)</span>, <span class="math">\(\sigma(x) \neq \sigma(x')\)</span>, because <span class="math">\(\sigma\)</span> is a permutation.</li>
<li>It has perfect collision property: Pick some <span class="math">\(n\)</span>-bit sequence, <span class="math">\(s\)</span>. The number of <span class="math">\(n\)</span>-bit sequences colliding with <span class="math">\(s\)</span> is independent of the choice of <span class="math">\(s\)</span> (all equivalence class have equal size).</li>
</ul>
<h2 id="reduction-to-the-permutation-family">Reduction to the permutation family</h2>
<p>A lot of properties of the function are directly inherited from the quality of the permutation family. In fact, it can be shown that if the permutation family is a family of random oracles, the function is a perfect PRF.</p>
<p>Similarly, if the permutations are uniformly distributed over some input, the constructed function will be as well.</p>
<p>Almost all of the statistical properties, I can think of, has this kind of reductive property allowing us to prove it on the constructed property.</p>
<h2 id="a-good-family-of-permutations">A good family of permutations</h2>
<p>This is a really hard question. Analyzing a single permutation is easy, but analyzing a family of permutations can be pretty hard. Why? Because you need to show their independence.</p>
<p>If one permutation had some dependence on another, the hash function could have poor quality, even if the permutations are pseudorandom, when studied individually.</p>
<h1 id="parallelization">Parallelization</h1>
<p>I'm the author of <a href="https://ticki.github.io/blog/seahash-explained/">SeaHash</a>, and a big part of the design of SeaHash was to parallelize it.</p>
<p>And I could, pretty well. But with its design, there will always be a limit to this parallelization. In case of SeaHash, this limit is 4 (as there are 4 lanes). However, one could imagine hardware where such parallelization ideally should be say 32.</p>
<p>This construction allows for exactly this, without changing the specification. The function is adaptive: The implementation can choose whatever number of parallel lanes to hash in.</p>
<p>This can be done by simply breaking the input up in <span class="math">\(k\)</span> strings, and hashing each individually, starting with <span class="math">\(n\)</span> being the offset of the string.</p>
<p>This is a fairly nice property, as it also allows combination of threaded parallelization and ILP without any constant overhead. Say I'm hashing 4 TB of data, then I could spawn 4 threads (depending on your hardware) and still exploit the 4 CPU pipelines, while not hurting the performance of hashing only a few bytes.</p>