Algorithms on Ticki's blog
http://ticki.github.io/tags/algorithms/index.xml
Recent content in Algorithms on Ticki's blogHugo -- gohugo.ioen-usThe Eudex Algorithm
http://ticki.github.io/blog/the-eudex-algorithm/
Sun, 11 Dec 2016 00:00:00 +0000http://ticki.github.io/blog/the-eudex-algorithm/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>Half a year ago, I designed <a href="https://github.com/ticki/eudex">Eudex</a> as a modern
replacement for Soundex, which is still widely used today. Eudex supports a
wide range of special-cases of European languages, while preserving the spirit
of simplicity, Soundex has.</p>
<p>Both Eudex and Soundex are phonetic algorithms that produce a representation of
the sound of some string. Eudex is fundamentally different from Soundex in that
it is not a phonetic classifier. It is a phonetic locality-sensitive hash,
which means that two similarly-sounding strings are not mapped to the same
value, but instead to values near to each other.</p>
<p>This technically makes it a string similarity index, but it one should be
careful with this term, given that it doesn't produce a typing distance, but a
phonetic/pronounciation distance.</p>
<p>What this blog post aims to do is to describe the rationale behind Eudex,
hopefully sparking new ideas and thoughts for the reader.</p>
<h1 id="the-output-and-the-input">The output and the input</h1>
<p>So, the input is any Unicode string in a Latin-family alphabet.</p>
<p>The output is fixed-width integer (we'll use 64-bit, but that is in some cases
a very narrow width), which has following characteristic:</p>
<blockquote>
<p>If the string <span class="math">\(a\)</span> sounds similar to a string <span class="math">\(b\)</span>, <span class="math">\(f(a) \oplus f(b)\)</span>
has low Hamming weight.</p>
</blockquote>
<p>In other words, two similarly sounding words maps to numbers with only a few
bits flipped, whereas words without similar sound maps to numbers with many
bits flipped.</p>
<h1 id="the-algorithm">The algorithm</h1>
<p>The algorithm itself is fairly simple. It outputs an 8 byte array (an unsigned
64 bit integer):</p>
<p><span class="math">\[\underbrace{A}_{\text{First phone}} \quad \underbrace{00}_{\text{Padding}} \quad \underbrace{BBBBB}_{\text{Trailing phones}}\]</span></p>
<p>The crucial point here is that all the characters are mapped through a table
carefully derived by their phonetic classification, to make similar sounding
phones have a low Hamming distance.</p>
<p>If two consecutive phones shares all the bits, but the parity bit, (i.e, <span class="math">\(a \gg 1
= b \gg 1\)</span>), the second is skipped. This allows us to "collapse" similar or
equal phones into one, kind of equivalence to the collapse stage of Soundex:
Similar phones next to each other can often be collapsed to one of the phones
without losing the pronounciation.</p>
<h1 id="deriving-the-tables">Deriving the tables</h1>
<p>The tables are what makes it interesting. There are four tables: one for ASCII
letters (not characters, letters) in the first slot ('A'), one for C1 (Latin
Supplement) characters in the first slot, one for ASCII letters in the trailing
phones, and one for the C1 (Latin Supplement) characters for the trailing
phones.</p>
<p>There is a crucial distinction between consonants and vowels in Eudex. The
first phone treat vowels as first-class citizens by making distinctions between
all the properties of vowels. The trailing phones only have a distinction
between open and close vowels.</p>
<h2 id="trailing-character-table">Trailing character table</h2>
<p>Let's start with the tables for the trailing characters. Consonants' bytes are
treated such that each bit represent a property of the phone (i.e.,
pronunciation) with the exception of the rightmost bit, which is used for
tagging duplicates (it acts as a discriminant).</p>
<p>Let's look at the classification of IPA consonants:</p>
<p><figure><img src="https://upload.wikimedia.org/wikipedia/en/5/5e/IPA_consonants_2005.png" alt="IPA"></figure></p>
<p>As you may notice, characters often represent more than one phone, and
reasoning about which one a given character in a given context represents can
be very hard. So we have to do our best in fitting each character into the
right phonetic category.</p>
<p>We have to pick the classification intelligently. There are certain groups the
table doesn't contain, one of which turns out to be handy in a classification:
liquid consonants (lateral consonants + rhotics), namely <code>r</code> and <code>l</code>. Under
ideal conditions, these should be put into to distinct bits, but unfortunately
there are only 8 bits in a byte, so we have to limit ourselves.</p>
<p>Now, every good phonetic hasher should be able to segregate important
characters (e.g., hard to mispell, crucial to the pronunciation of the word)
from the rest. Therefore we add a category we call "confident", this will
occupy the most significant bit. In our category of "confident" characters we
put l, r, x, z, and q, since these are either:</p>
<ol>
<li>Crucial to the sound of the word (and thus easier to hear, and harder to
misspell).</li>
<li>Rare to occur, and thus statistically harder to mistake.</li>
</ol>
<p>So our final trailing consonant table looks like:</p>
<table>
<thead>
<tr>
<th>Position</th>
<th align="right">Modifier</th>
<th>Property</th>
<th align="center">Phones</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td align="right">1</td>
<td>Discriminant</td>
<td align="center">(for tagging duplicates)</td>
</tr>
<tr>
<td>2</td>
<td align="right">2</td>
<td>Nasal</td>
<td align="center">mn</td>
</tr>
<tr>
<td>3</td>
<td align="right">4</td>
<td>Fricative</td>
<td align="center">fvsjxzhct</td>
</tr>
<tr>
<td>4</td>
<td align="right">8</td>
<td>Plosive</td>
<td align="center">pbtdcgqk</td>
</tr>
<tr>
<td>5</td>
<td align="right">16</td>
<td>Dental</td>
<td align="center">tdnzs</td>
</tr>
<tr>
<td>6</td>
<td align="right">32</td>
<td>Liquid</td>
<td align="center">lr</td>
</tr>
<tr>
<td>7</td>
<td align="right">64</td>
<td>Labial</td>
<td align="center">bfpv</td>
</tr>
<tr>
<td>8</td>
<td align="right">128</td>
<td>Confident¹</td>
<td align="center">lrxzq</td>
</tr>
</tbody>
</table>
<p>The more "important" the characteristic is to the phone's sound the higher
place it has.</p>
<p>We then have to treat the vowels. In particular, we don't care much of vowels
in trailing position, so we will simply divide them into two categories: open
and close. It is worth noting that not all vowels fall into these categories,
therefore we will simply place it in the category it is "nearest to", e.g. a,
(e), o gets 0 for "open".</p>
<p>So our final ASCII letter table for the trailing phones looks like:</p>
<pre><code> (for consonants)
+--------- Confident
|+-------- Labial
||+------- Liquid
|||+------ Dental
||||+----- Plosive
|||||+---- Fricative
||||||+--- Nasal
|||||||+-- Discriminant
||||||||
a* 00000000
b 01001000
c 00001100
d 00011000
e* 00000001
f 01000100
g 00001000
h 00000100
i* 00000001
j 00000101
k 00001001
l 10100000
m 00000010
n 00010010
o* 00000000
p 01001001
q 10101000
r 10100001
s 00010100
t 00011101
u* 00000001
v 01000101
w 00000000
x 10000100
y* 00000001
z 10010100
| (for vowels)
+-- Close
</code></pre>
<p>Now, we extend our table to C1 characters by the same method:</p>
<pre><code> (for consonants)
+--------- Confident
|+-------- Labial
||+------- Liquid
|||+------ Dental
||||+----- Plosive
|||||+---- Fricative
||||||+--- Nasal
|||||||+-- Discriminant
||||||||
ß -----s-1 (use 's' from the table above with the last bit flipped)
à 00000000
á 00000000
â 00000000
ã 00000000
ä 00000000 [æ]
å 00000001 [oː]
æ 00000000 [æ]
ç -----z-1 [t͡ʃ]
è 00000001
é 00000001
ê 00000001
ë 00000001
ì 00000001
í 00000001
î 00000001
ï 00000001
ð 00010101 [ð̠] (represented as a non-plosive T)
ñ 00010111 [nj] (represented as a combination of n and j)
ò 00000000
ó 00000000
ô 00000000
õ 00000000
ö 00000001 [ø]
÷ 11111111 (placeholder)
ø 00000001 [ø]
ù 00000001
ú 00000001
û 00000001
ü 00000001
ý 00000001
þ -----ð-- [ð̠] (represented as a non-plosive T)
ÿ 00000001
| (for vowels)
+-- Close
</code></pre>
<h2 id="first-phone-table">First phone table</h2>
<p>So far we have considered the trailing phones, now we need to look into the
first phone. The first phone needs a table with minimal collisions, since you
hardly ever misspell the first letter in the word. Ideally, the table should be
injective, but due to technical limitations it is not possible.</p>
<p>We will use the first bit to distinguish between vowels and consonants.</p>
<p>Previously we have only divided vowels into to classes, we will change that
now, but first: the consonants. To avoid repeating ourselves, we will use a
method for reusing the above tables.</p>
<p>Since the least important property is placed to the left, we will simply shift
it to the right (that is, truncating the rightmost bit). The least significant
bit will then be flipped when encountering a duplicate. This way we preserve
the low Hamming distance, while avoiding collisions.</p>
<p>The vowels are more interesting. We need a way to distinguish between vowels
and their sounds.</p>
<p>Luckily, their classification is quite simple:</p>
<p><figure><img src="https://upload.wikimedia.org/wikipedia/en/5/5a/IPA_vowel_chart_2005.png" alt="IPA"></figure></p>
<p>If a vowel appears as two phones (e.g., dependent on language), we OR them, and
possibly modify the discriminant if it collides with another phone.</p>
<p>We need to divide each of the axises into more than two categories, to utilize
all our bits, so some properties will have to occupy multiple bits.</p>
<table>
<thead>
<tr>
<th>Position</th>
<th align="right">Modifier</th>
<th>Property (vowel)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td align="right">1</td>
<td>Discriminant</td>
</tr>
<tr>
<td>2</td>
<td align="right">2</td>
<td>Is it open-mid?</td>
</tr>
<tr>
<td>3</td>
<td align="right">4</td>
<td>Is it central?</td>
</tr>
<tr>
<td>4</td>
<td align="right">8</td>
<td>Is it close-mid?</td>
</tr>
<tr>
<td>5</td>
<td align="right">16</td>
<td>Is it front?</td>
</tr>
<tr>
<td>6</td>
<td align="right">32</td>
<td>Is it close?</td>
</tr>
<tr>
<td>7</td>
<td align="right">64</td>
<td>More close than [ɜ]</td>
</tr>
<tr>
<td>8</td>
<td align="right">128</td>
<td>Vowel?</td>
</tr>
</tbody>
</table>
<p>So we make use of both properties, namely both the openness and "frontness".
Moreover, we allow more than just binary choices:</p>
<pre><code> Class Close Close-mid Open-mid Open
+----------+----------+-----------+---------+
Bits .11..... ...11... ......1. .00.0.0.
</code></pre>
<p>Let's do the same for the other axis:</p>
<pre><code> Class Front Central Back
+----------+----------+----------+
Bits ...1.0.. ...0.1.. ...0.0..
</code></pre>
<p>To combine the properties we OR these tables. Applying this technique, we get:</p>
<pre><code> (for vowels)
+--------- Vowel
|+-------- Closer than ɜ
||+------- Close
|||+------ Front
||||+----- Close-mid
|||||+---- Central
||||||+--- Open-mid
|||||||+-- Discriminant
||||||||
a* 10000100
b 00100100
c 00000110
d 00001100
e* 11011000
f 00100010
g 00000100
h 00000010
i* 11111000
j 00000011
k 00000101
l 01010000
m 00000001
n 00001001
o* 10010100
p 00100101
q 01010100
r 01010001
s 00001010
t 00001110
u* 11100000
v 00100011
w 00000000
x 01000010
y* 11100100
z 01001010
</code></pre>
<p>We then extend it to C1 characters:</p>
<pre><code> +--------- Vowel?
|+-------- Closer than ɜ
||+------- Close
|||+------ Front
||||+----- Close-mid
|||||+---- Central
||||||+--- Open-mid
|||||||+-- Discriminant
||||||||
ß -----s-1 (use 's' from the table above with the last bit flipped)
à -----a-1
á -----a-1
â 10000000
ã 10000110
ä 10100110 [æ]
å 11000010 [oː]
æ 10100111 [æ]
ç 01010100 [t͡ʃ]
è -----e-1
é -----e-1
ê -----e-1
ë 11000110
ì -----i-1
í -----i-1
î -----i-1
ï -----i-1
ð 00001011 [ð̠] (represented as a non-plosive T)
ñ 00001011 [nj] (represented as a combination of n and j)
ò -----o-1
ó -----o-1
ô -----o-1
õ -----o-1
ö 11011100 [ø]
÷ 11111111 (placeholder)
ø 11011101 [ø]
ù -----u-1
ú -----u-1
û -----u-1
ü -----y-1
ý -----y-1
þ -----ð-- [ð̠] (represented as a non-plosive T)
ÿ -----y-1
</code></pre>
<h1 id="distance-operator">Distance operator</h1>
<p>Now that we have our tables. We now need the distance operator. A naïve
approach would be to simply use Hamming distance. This has the disadvantage of
all the bytes having the same weight, which isn't ideal, since you are more
likely to misspell later characters, than the first ones.</p>
<p>For this reason, we use weighted Hamming distance:</p>
<table>
<thead>
<tr>
<th align="left">Byte:</th>
<th align="right">1</th>
<th align="right">2</th>
<th align="right">3</th>
<th align="right">4</th>
<th align="right">5</th>
<th align="right">6</th>
<th align="right">7</th>
<th align="right">8</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Weight:</td>
<td align="right">128</td>
<td align="right">64</td>
<td align="right">32</td>
<td align="right">16</td>
<td align="right">8</td>
<td align="right">4</td>
<td align="right">2</td>
<td align="right">1</td>
</tr>
</tbody>
</table>
<p>Namely, we XOR the two values and then add each of the bytes' Hamming weight,
using the coefficients from the table above.</p>
SeaHash: Explained
http://ticki.github.io/blog/seahash-explained/
Thu, 08 Dec 2016 00:00:00 +0000http://ticki.github.io/blog/seahash-explained/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>So, not so long ago, I designed <a href="https://github.com/ticki/tfs/tree/master/seahash">SeaHash</a>, an alternative hash algorithm with performance better than most (all?) of the existing non-cryptographic hash functions available. I designed it for checksumming for a file system, I'm working on, but I quickly found out it was sufficient for general hashing.</p>
<p>It blew up. I got a lot of cool feedback, and yesterday it was picked as <a href="https://this-week-in-rust.org/blog/2016/12/06/this-week-in-rust-159/">crate of the week</a>. It shows that there is some interest in it, so I want to explain the ideas behind it.</p>
<h1 id="hashing-an-introduction">Hashing: an introduction</h1>
<p>The basic idea of hashing is to map data with patterns in it to pseudorandom values. It should be designed such that only few collisions happen.</p>
<p>Simply put, the hash function should behave like a pseudorandom function (PRF) producing a seemingly random stream from a non-random stream. It is similar to pseudorandom functions, in a sense, with difference that they must take variable input.</p>
<p>Formally, perfect PRFs are defined as follows:</p>
<blockquote>
<p><span class="math">\(f : \{0, 1\}^n \to \{0, 1\}^n\)</span> is a perfect PRF if and only if given a distribution <span class="math">\(d : \{0, 1\}^n \to \left[0,1\right]\)</span>, <span class="math">\(f\)</span> maps inputs following the distribution <span class="math">\(d\)</span> to the uniform distribution.</p>
</blockquote>
<p>Note that there is a major difference between cryptographic and non-cryptographic hash functions. SeaHash is not cryptographic, and that's very important to understand: It doesn't aim to be. It aims to give a good hash quality, but not cryptographic guarentees.</p>
<h1 id="constructing-a-hash-function-from-a-prf">Constructing a hash function from a PRF</h1>
<p>There are various ways to construct a variable-length hash function from a PRF. The most famous one is Merkle–Damgård construction. We will focus on this.</p>
<p>There are multiple ways to do Merkle–Damgård construction. The most famous one is the wide-pipe construction. It works by having a state, which combined with one block of the input data at a time. The final state will then be the hash value. This combining function is called a "compression function". It takes two blocks of same length and maps it to one: <span class="math">\(f : \{0, 1\}^n \times \{0, 1\}^n \to \{0, 1\}^n\)</span>.</p>
<p><span class="math">\[h = f(f(f(\ldots, b_0), b_1), b_2)\]</span></p>
<p>It is important that this compression emits pseudorandom behavior, and that's where PRFs comes in. For general-purpose hash function where we don't care about security, the construction usually looks like this:</p>
<p><span class="math">\[f(a, b) = p(a \oplus b)\]</span></p>
<p>This of course is commutative, but that doesn't matter, because we don't need non-commutativity in the Merkle–Damgård construction.</p>
<h1 id="choosing-a-prf">Choosing a PRF</h1>
<p>The <a href="http://www.pcg-random.org/">PCG family of PRFs</a> is my favorite PRF I've seen so far:</p>
<p><span class="math">\[\begin{align*}
x &\gets p \otimes x \\
x &\gets x \oplus ((x \gg 32) \gg (x \gg 60)) \\
x &\gets p \otimes x
\end{align*}\]</span></p>
<p>(<span class="math">\(\otimes\)</span> means modular multiplication)</p>
<p>The PCG paper goes into depth on why this. In particular, it is a quite uncommon to use these kinds of non-fixed shifts.</p>
<p>This is a bijective function, which means that we can't ever have less entropy than the input, which is a good property to have in a hash function.</p>
<h1 id="parallelism">Parallelism</h1>
<p>This construction of course relies on dependencies between the states, rendering it impossible to parallelize.</p>
<p>We need a way to be able to independently calculate multiple block updates. With a single state, this is simply not possible, but fear not, we can add multiple states.</p>
<p>Instruction-level parallelism means that we don't even need to fire up multiple threads (which would be quite expensive), but simply exploit CPU's instruction pipelines.</p>
<p><figure><img src="http://ticki.github.io/img/seahash_state_update_diagram.svg" alt="A diagram of the new design."></figure></p>
<p>In the above diagram, you can see a 4-state design, where every state except the first is shifted down. The first state (<span class="math">\(a\)</span>) gets the last one (<span class="math">\(d\)</span>) after being combined with the input block (<span class="math">\(D\)</span>) through our PRF:</p>
<p><span class="math">\[\begin{align*}
a' &= b \\
b' &= c \\
c' &= d \\
d' &= f(a \oplus D) \\
\end{align*}\]</span></p>
<p>It isn't obvious how this design allows parallelism, but it has to do with the fact that you can unroll the loop, such that the shifts aren't needed. In particular, after 4 rounds, everything is back at where it started:</p>
<p><span class="math">\[\begin{align*}
a &\gets f(a \oplus B_1) \\
b &\gets f(b \oplus B_2) \\
c &\gets f(c \oplus B_3) \\
d &\gets f(d \oplus B_4) \\
\end{align*}\]</span></p>
<p>If we take 4 rounds every iteration, we get 4 independent state updates, which are run in parallel.</p>
<p>This is also called an <em>alternating 4-state Merkle–Damgård construction</em>.</p>
<h2 id="finalizing-the-four-states">Finalizing the four states</h2>
<p>Naively, we would just XOR the four states (which have difference initialization vectors, and hence would not commute).</p>
<p>There are some issues: What if the input doesn't divide our 4 blocks? Well, the simple solution is of course padding, but that gives us another issue: how do we distinguish between padding and normal zeros?</p>
<p>We XOR the length with the hash value. Unfortunately, this is obviously not discrete, since appending another zero would then only affect the value slightly, so we need to run it through our PRF:</p>
<p><figure><img src="http://ticki.github.io/img/seahash_finalization_diagram.svg" alt="SeaHash finalization"></figure></p>
<p>One concern I've heard is that XOR is commutative, and hence permuting the states wouldn't affect the output. But that's simply not true: Each state has distinct initial value, making each lane hash differently.</p>
<h1 id="putting-it-all-together">Putting it all together</h1>
<p>We can finally put it all together:</p>
<p><a href="http://ticki.github.io/img/seahash_construction_diagram.svg"><figure><img src="http://ticki.github.io/img/seahash_construction_diagram.svg" alt="SeaHash construction"></figure></a></p>
<p>(click to zoom)</p>
<p>You can see the code and benchmarks <a href="https://github.com/ticki/tfs/tree/master/seahash">here</a>.</p>
Designing a good non-cryptographic hash function
http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/
Fri, 04 Nov 2016 16:28:44 +0200http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>So, I've been needing a hash function for various purposes, lately. None of the existing hash functions I could find were sufficient for my needs, so I went and designed my own. These are my notes on the design of hash functions.</p>
<h1 id="what-is-a-hash-function-really">What is a hash function <em>really</em>?</h1>
<p>Hash functions are functions which maps a infinite domain to a finite codomain. Two elements in the domain, <span class="math">\(a, b\)</span> are said to collide if <span class="math">\(h(a) = h(b)\)</span>.</p>
<p>The ideal hash functions has the property that the distribution of image of a a subset of the domain is statistically independent of the probability of said subset occuring. That is, collisions are not likely to occur even within non-uniform distributed sets.</p>
<p>Consider you have an english dictionary. Clearly, <code>hello</code> is more likely to be a word than <code>ctyhbnkmaasrt</code>, but the hash function must not be affected by this statistical redundancy.</p>
<p>In a sense, you can think of the ideal hash function as being a function where the output is uniformly distributed (e.g., chosen by a sequence of coinflips) over the codomain no matter what the distribution of the input is.</p>
<p>With a good hash function, it should be hard to distinguish between a truely random sequence and the hashes of some permutation of the domain.</p>
<p>Hash function ought to be as chaotic as possible. A small change in the input should appear in the output as if it was a big change. This is called the hash function butterfly effect.</p>
<h2 id="noncryptographic-and-cryptographic">Non-cryptographic and cryptographic</h2>
<p>One must make the distinction between cryptographic and non-cryptographic hash functions. In a cryptographic hash function, it must be infeasible to:</p>
<ol>
<li>Generate the input from its hash output.</li>
<li>Generate two inputs with the same output.</li>
</ol>
<p>Non-cryptographic hash functions can be thought of as approximations of these invariants. The reason for the use of non-cryptographic hash function is that they're significantly faster than cryptographic hash functions.</p>
<h1 id="diffusions-and-bijection">Diffusions and bijection</h1>
<p>The basic building block of good hash functions are difussions. Difussions can be thought of as bijective (i.e. every input has one and only one output, and vice versa) hash functions, namely that input and output are uncorrelated:</p>
<p><figure><img src="http://ticki.github.io/img/bijective_diffusion_diagram.svg" alt="A diagram of a diffusion."></figure></p>
<p>This diffusion function has a relatively small domain, for illustrational purpose.</p>
<h2 id="building-a-good-diffusion">Building a good diffusion</h2>
<p>Diffusions are often build by smaller, bijective components, which we will call "subdiffusions".</p>
<h3 id="types-of-subdiffusions">Types of subdiffusions</h3>
<p>One must distinguish between the different kinds of subdiffusions.</p>
<p>The first class to consider is the <strong>bitwise subdiffusions</strong>. These are quite weak when they stand alone, and thus must be combined with other types of subdiffusions. Bitwise subdiffusions might flip certain bits and/or reorganize them:</p>
<p><span class="math">\[d(x) = \sigma(x) \oplus m\]</span></p>
<p>(we use <span class="math">\(\sigma\)</span> to denote permutation of bits)</p>
<p>The second class is <strong>dependent bitwise subdiffusions</strong>. These are diffusions which permutes the bits and XOR them with the original value:</p>
<p><span class="math">\[d(x) = \sigma(x) \oplus x\]</span></p>
<p>(exercise to reader: prove that the above subdivision is revertible)</p>
<p>Another similar often used subdiffusion in the same class is the XOR-shift:</p>
<p><span class="math">\[d(x) = (x \ll m) \oplus x\]</span></p>
<p>(note that <span class="math">\(m\)</span> can be negative, in which case the bitshift becomes a right bitshift)</p>
<p>The next subdiffusion are of massive importance. It's the class of <strong>linear subdiffusions</strong> similar to the LCG random number generator:</p>
<p><span class="math">\[d(x) \equiv ax + c \pmod m, \quad \gcd(x, m) = 1\]</span></p>
<p>(<span class="math">\(\gcd\)</span> means "greatest common divisor", this constraint is necessary in order to have <span class="math">\(a\)</span> have an inverse in the ring)</p>
<p>The next are particularly interesting, it's the <strong>arithmetic subdiffusions</strong>:</p>
<p><span class="math">\[d(x) = x \oplus (x + c)\]</span></p>
<h3 id="combining-subdiffusions">Combining subdiffusions</h3>
<p>Subdiffusions themself are quite poor quality. Combining them is what creates a good diffusion function.</p>
<p>Indeed if you combining enough different subdiffusions, you get a good diffusion function, but there is a catch: The more subdiffusions you combine the slower it is to compute.</p>
<p>As such, it is important to find a small, diverse set of subdiffusions which has a good quality.</p>
<h3 id="zerosensitivity">Zero-sensitivity</h3>
<p>If your diffusion isn't zero-sensitive (i.e., <span class="math">\(f(0) = \{0, 1\}\)</span>), you should <del>panic</del> come up with something better. In particular, make sure your diffusion contains at least one zero-sensitive subdiffusion as component.</p>
<h3 id="avalanche-diagrams">Avalanche diagrams</h3>
<p>Avalanche diagrams are the best and quickist way to find out if your diffusion function has a good quality.</p>
<p>Essentially, you draw a grid such that the <span class="math">\((x, y)\)</span> cell's color represents the probability that flipping <span class="math">\(x\)</span>'th bit of the input will result of <span class="math">\(y\)</span>'th bit being flipped in the output. If <span class="math">\((x, y)\)</span> is very red, the probability that <span class="math">\(d(a')\)</span>, where <span class="math">\(a'\)</span> is <span class="math">\(a\)</span> with the <span class="math">\(x\)</span>'th bit flipped,' has the <span class="math">\(y\)</span>'th bit flipped is very high.</p>
<p>Here's an example of the identity function, <span class="math">\(f(x) = x\)</span>:</p>
<p><figure><img src="http://ticki.github.io/img/identity_function_avalanche_diagram.svg" alt="The identity function."></figure></p>
<p>So why is it a straight line?</p>
<p>Well, if you flip the <span class="math">\(n\)</span>'th bit in the input, the only bit flipped in the output is the <span class="math">\(n\)</span>'th bit. That's kind of boring, let's try adding a number:</p>
<p><figure><img src="http://ticki.github.io/img/addition_avalanche_diagram.svg" alt="Adding a big number."></figure></p>
<p>Meh, this is kind of obvious. Let's try multiplying by a prime:</p>
<p><figure><img src="http://ticki.github.io/img/prime_multiplication_avalanche_diagram.svg" alt="Multiplying by a non-even prime is a bijection."></figure></p>
<p>Now, this is quite interesting actually. We call all the black area "blind spots", and you can see here that anything with <span class="math">\(x > y\)</span> is a blind spot. Why is that? Well, if I flip a high bit, it won't affect the lower bits because you can see multiplication as a form of overlay:</p>
<pre><code>100011101000101010101010111
:
111
↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕
100000001000101010101010111
:
111
</code></pre>
<p>Flipping a single bit will only change the integer forward, never backwards, hence it forms this blind spot. So how can we fix this (we don't want this bias)?</p>
<h4 id="designing-a-diffusion-function--by-example">Designing a diffusion function -- by example</h4>
<p>If we throw in (after prime multiplication) a dependent bitwise-shift subdiffusions, we have</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \ll z) \\
\end{align*}\]</span></p>
<p>(note that we have the <span class="math">\(+1\)</span> in order to make it zero-sensitive)</p>
<p>This generates following avalanche diagram</p>
<p><figure><img src="http://ticki.github.io/img/shift_xor_multiply_avalanche_diagram.svg" alt="Shift-XOR then multiply."></figure></p>
<p>What can cause these? Clearly there is some form of bias. Turns out that this bias mostly originates in the lack of hybrid arithmetic/bitwise sub.
Without such hybrid, the behavior tends to be relatively local and not interfering well with each other.</p>
<p><span class="math">\[x \gets x + \text{ROL}_k(x)\]</span></p>
<p>At this point, it looks something like</p>
<p><figure><img src="http://ticki.github.io/img/shift_xor_multiply_rotate_avalanche_diagram.svg" alt="Shift-XOR then multiply."></figure></p>
<p>That's good, but we're not quite there yet...</p>
<p>Let's throw in the following bijection:</p>
<p><span class="math">\[x \gets px \oplus (px \gg z)\]</span></p>
<p>And voilà, we now have a perfect bit independence:</p>
<p><figure><img src="http://ticki.github.io/img/perfect_avalanche_diagram.svg" alt="Everything is red!"></figure></p>
<p>So our finalized version of an example diffusion is</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \ll z) \\
x &\gets x + \text{ROL}_k(x) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
\end{align*}\]</span></p>
<p>That seems like a pretty lengthy chunk of operations. We will try to boil it down to few operations while preserving the quality of this diffusion.</p>
<p>The most obvious think to remove is the rotation line. But it hurts quality:</p>
<p><figure><img src="http://ticki.github.io/img/multiply_up_avalanche_diagram.svg" alt="Here's the avalanche diagram of said line removed."></figure></p>
<p>Where do these blind spot comes from? The answer is pretty simple: shifting left moves the entropy upwards, hence the multiplication will never really flip the lower bits. For example, if we flip the sixth bit, and trace it down the operations, you will how it never flips in the other end.</p>
<p>So what do we do? Instead of shifting left, we need to shift right, since multiplication only affects upwards:</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
\end{align*}\]</span></p>
<p>And we're back again. This time with two less instructions.</p>
<table>
<thead>
<tr>
<th><figure><img src="http://ticki.github.io/img/cakehash_stage_1_avalanche_diagram.svg" alt="Stage 1"></figure></th>
<th><figure><img src="http://ticki.github.io/img/cakehash_stage_2_avalanche_diagram.svg" alt="Stage 2"></figure></th>
</tr>
</thead>
<tbody>
<tr>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_3_avalanche_diagram.svg" alt="Stage 3"></figure></p>
</td>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_4_avalanche_diagram.svg" alt="Stage 4"></figure></p>
</td>
</tr>
<tr>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_5_avalanche_diagram.svg" alt="Stage 5"></figure></p>
</td>
<td><p><figure><img src="http://ticki.github.io/img/perfect_avalanche_diagram.svg" alt="Stage 6"></figure></p>
</td>
</tr>
</tbody>
</table>
<h1 id="combining-diffusions">Combining diffusions</h1>
<p>Diffusions maps a finite state space to a finite state space, as such they're not alone sufficient as arbitrary-length hash function, so we need a way to combine diffusions.</p>
<p>In particular, we can eat <span class="math">\(N\)</span> bytes of the input at once and modify the state based on that:</p>
<p><span class="math">\[s' = d(f(s', x))\]</span></p>
<p>Or in graphic form,</p>
<p><figure><img src="http://ticki.github.io/img/hash_round_flowchart.svg" alt="A flowchart."></figure></p>
<p><span class="math">\(f(s', x)\)</span> is what we call our combinator function. It serves for combining the old state and the new input block (<span class="math">\(x\)</span>). <span class="math">\(d(a)\)</span> is just our diffusion function.</p>
<p>It doesn't matter if the combinator function is commutative or not, but it is crucial that it is not biased, i.e. if <span class="math">\(a, b\)</span> are uniformly distributed variables, <span class="math">\(f(a, b)\)</span> is too. Ideally, there should exist a bijection, <span class="math">\(g(f(a, b), b) = a\)</span>, which implies that it is not biased.</p>
<p>An example of such combination function is simple addition.</p>
<p><span class="math">\[f(a, b) = a + b\]</span></p>
<p>Another is</p>
<p><span class="math">\[f(a, b) = a \oplus b\]</span></p>
<p>I'm partial towards saying that these are the only sane choices for combinator functions, and you must pick between them based on the characteristics of your diffusion function:</p>
<ol>
<li>If your diffusion function is primarily based on arithmetics, you should use the XOR combinator function.</li>
<li>If your diffusion function is primarily based on bitwise operations, you should use the additive combinator function.</li>
</ol>
<p>The reason for this is that you want to have the operations to be as diverse as possible, to create complex, seemingly random behavior.</p>
<h1 id="simd-simd-simd">SIMD, SIMD, SIMD</h1>
<p>If you want good performance, you shouldn't read only one byte at a time. By reading multiple bytes at a time, your algorithm becomes several times faster.</p>
<p>This however introduces the need for some finalization, if the total number of written bytes doesn't divide the number of bytes read in a round. One possibility is to pad it with zeros and write the total length in the end, however this turns out to be somewhat slow for small inputs.</p>
<p>A better option is to write in the number of padding bytes into the last byte.</p>
<h1 id="instruction-level-parallelism">Instruction level parallelism</h1>
<p>Fetching multiple blocks and sequentially (without dependency until last) running a round is something I've found to work well. This has to do with the so-called instruction pipeline in which modern processors run instructions in parallel when they can.</p>
<h1 id="testing-the-hash-function">Testing the hash function</h1>
<p>Multiple test suits for testing the quality and performance of your hash function. <a href="https://github.com/aappleby/smhasher">Smhasher</a> is one of these.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Many relatively simple components can be combined into a strong and robust non-cryptographic hash function for use in hash tables and in checksumming. Deriving such a function is really just coming up with the components to construct this hash function.</p>
<p>Breaking the problem down into small subproblems significantly simplifies analysis and guarantees.</p>
<p>The key to a good hash function is to try-and-miss. Testing and throwing out candidates is the only way you can really find out if you hash function works in practice.</p>
<p>Have fun hacking!</p>
How LZ4 works
http://ticki.github.io/blog/how-lz4-works/
Tue, 25 Oct 2016 23:25:15 +0200http://ticki.github.io/blog/how-lz4-works/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>LZ4 is a really fast compression algorithm with a reasonable compression ratio, but unfortunately there is limited documentation on how it works. The only explanation (not spec, explanation) <a href="https://fastcompression.blogspot.com/2011/05/lz4-explained.html">can be found</a> on the author's blog, but I think it is less of an explanation and more of an informal specification.</p>
<p>This blog post tries to explain it such that anybody (even new beginners) can understand and implement it.</p>
<h1 id="linear-smallinteger-code-lsic">Linear small-integer code (LSIC)</h1>
<p>The first part of LZ4 we need to explain is a smart but simple integer encoder. It is very space efficient for 0-255, and then grows linearly, based on the assumption that the integers used with this encoding rarely exceeds this limit, as such it is only used for small integers in the standard.</p>
<p>It is a form of addition code, in which we read a byte. If this byte is the maximal value (255), another byte is read and added to the sum. This process is repeated until a byte below 255 is reached, which will be added to the sum, and the sequence will then end.</p>
<p><figure><img src="http://ticki.github.io/img/lz4_int_encoding_flowchart.svg" alt="We try to fit it into the next cluster."></figure></p>
<p>In short, we just keep adding bytes and stop when we hit a non-0xFF byte.</p>
<p>We'll use the name "LSIC" for convinience.</p>
<h1 id="block">Block</h1>
<p>An LZ4 stream is divided into segments called "blocks". Blocks contains a literal which is to be copied directly to the output stream, and then a back reference, which tells us to copy some number of bytes from the already decompressed stream.</p>
<p>This is really were the compression is going on. Copying from the old stream allows deduplication and runs-length encoding.</p>
<h2 id="overview">Overview</h2>
<p>A block looks like:</p>
<p><span class="math">\[\overbrace{\underbrace{t_1}_\text{4 bits}\ \underbrace{t_2}_\text{4 bits}}^\text{Token} \quad \underbrace{\overbrace{e_1}^\texttt{LISC}}_\text{If $t_1 = 15$} \quad \underbrace{\overbrace{L}^\text{Literal}}_{t_1 + e_1\text{ bytes }} \quad \overbrace{\underbrace{O}_\text{2 bytes}}^\text{Little endian} \quad \underbrace{\overbrace{e_2}^\texttt{LISC}}_\text{If $t_2 = 15$}\]</span></p>
<p>And decodes to the <span class="math">\(L\)</span> segment, followed by a <span class="math">\(t_2 + e_2 + 4\)</span> bytes sequence copied from position <span class="math">\(l - O\)</span> from the output buffer (where <span class="math">\(l\)</span> is the length of the output buffer).</p>
<p>We will explain all of these in the next sections.</p>
<h2 id="token">Token</h2>
<p>Any block starts with a 1 byte token, which is divided into two 4-bit fields.</p>
<h2 id="literals">Literals</h2>
<p>The first (highest) field in the token is used to define the literal. This obviously takes a value 0-15.</p>
<p>Since we might want to encode higher integer, as such we make use of LSIC encoding: If the field is 15 (the maximal value), we read an integer with LSIC and add it to the original value (15) to obtain the literals length.</p>
<p>Call the final value <span class="math">\(L\)</span>.</p>
<p>Then we forward the next <span class="math">\(L\)</span> bytes from the input stream to the output stream.</p>
<p><figure><img src="http://ticki.github.io/img/lz4_literals_copy_diagram.svg" alt="We copy from the buffer directly."></figure></p>
<h2 id="deduplication">Deduplication</h2>
<p>The next few bytes are used to define some segment in the already decoded buffer, which is going to be appended to the output buffer.</p>
<p>This allows us to transmit a position and a length to read from in the already decoded buffer instead of transmitting the literals themself.</p>
<p>To start with, we read a 16-bit little endian integer. This defines the so called offset, <span class="math">\(O\)</span>. It is important to understand that the offset is not the starting position of the copied buffer. This starting point is calculated by <span class="math">\(l - O\)</span> with <span class="math">\(l\)</span> being the number of bytes already decoded.</p>
<p>Secondly, similarly to the literals length, if <span class="math">\(t_2\)</span> is 15 (the maximal value), we use LSIC to "extend" this value and we add the result. This plus 4 yields the number of bytes we will copy from the output buffer. The reason we add 4 is because copying less than 4 bytes would result in a negative expansion of the compressed buffer.</p>
<p>Now that we know the start position and the length, we can append the segment to the buffer itself:</p>
<p><figure><img src="http://ticki.github.io/img/lz4_deduplicating_diagram.svg" alt="Copying in action."></figure></p>
<p>It is important to understand that the end of the segment might not be initializied before the rest of the segment is appended, because overlaps are allowed. This allows a neat trick, namely "runs-length encoding", where you repeat some sequence a given number of times:</p>
<p><figure><img src="http://ticki.github.io/img/lz4_runs_encoding_diagram.svg" alt="We repeat the last byte."></figure></p>
<p>Note that the duplicate section is not required if you're in the end of the stream, i.e. if there's no more compressed bytes to read.</p>
<h1 id="compression">Compression</h1>
<p>Until now, we have only considered decoding, not the reverse process.</p>
<p>A dozen of approaches to compression exists. They have the aspects that they need to be able to find duplicates in the already input buffer.</p>
<p>In general, there are two classes of such compression algorithms:</p>
<ol>
<li>HC: High-compression ratio algorithms, these are often very complex, and might include steps like backtracking, removing repeatation, non-greediy.</li>
<li>FC: Fast compression, these are simpler and faster, but provides a slightly worse compression ratio.</li>
</ol>
<p>We will focus on the FC-class algorithms.</p>
<p>Binary Search Trees (often B-trees) are often used for searching for duplicates. In particular, every byte iterated over will add a pointer to the rest of the buffer to a B-tree, we call the "duplicate tree". Now, B-trees allows us to retrieve the largest element smaller than or equal to some key. In lexiographic ordering, this is equivalent to asking the element sharing the largest number of bytes as prefix.</p>
<p>For example, consider the table:</p>
<pre><code>abcdddd => 0
bcdddd => 1
cdddd => 2
dddd => 3
</code></pre>
<p>If we search for <code>cddda</code>, we'll get a partial match, namely <code>cdddd => 2</code>. So we can quickly find out how many bytes they have in common as prefix. In this case, it is 4 bytes.</p>
<p>What if we found no match or a bad match (a match that shares less than some threshold)? Well, then we write it as literal until a good match is found.</p>
<p>As you may notice, the dictionary grows linearly. As such, it is important that you reduce memory once in a while, by trimming it. Note that just trimming the first (or last) <span class="math">\(N\)</span> entries is inefficient, because some might be used often. Instead, a <a href="https://en.wikipedia.org/wiki/Cache_Replacement_Policies">cache replacement policy</a> should be used. If the dictionary is filled, the cache replacement policy should determine which match should be replaced. I've found PLRU a good choice of CRP for LZ4 compression.</p>
<p>Note that you should add additional rules like being addressible (within <span class="math">\(2^{16} + 4\)</span> bytes of the cursor, which is required because <span class="math">\(O\)</span> is 16-bit) and being above some length (smaller keys have worse block-level compression ratio).</p>
<p>Another faster but worse (compression-wise) approach is hashing every four bytes and placing them in a table. This means that you can only look up the latest sequence given some 4-byte prefix. Looking up allows you to progress and see how long the duplicate sequence match. When you can't go any longer, you encode the literals section until another duplicate 4-byte is found.</p>
<h1 id="conclusion">Conclusion</h1>
<p>LZ4 is a reasonably simple algorithm with reasonably good compression ratio. It is the type of algorithm that you can implement on an afternoon without much complication.</p>
<p>If you need a portable and efficient compression algorithm which can be implement in only a few hundreds of lines, LZ4 would be my go-to.</p>
On Random-Access Compression
http://ticki.github.io/blog/on-random-access-compression/
Sun, 23 Oct 2016 23:25:15 +0200http://ticki.github.io/blog/on-random-access-compression/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>This post will contains an algorithm I came up with, doing efficient rolling compression. It's going to be used in <a href="https://github.com/ticki/tfs">TFS</a>.</p>
<h1 id="what-is-rolling-compression">What is rolling compression?</h1>
<p>Consider that you have a large file and you want to compress it. That's easy enough and many algorithms exists for doing so. Now, consider that you want to read or write a small part of the file.</p>
<p>Most algorithms would require you to decompress, write, and recompress the whole file. Clearly, this gets expensive when the file is big.</p>
<h1 id="clusterbased-compression">Cluster-based compression</h1>
<p>A cluster is some small fixed-size block (often 512, 1024, or 4096 bytes). We can have a basic cluster allocator by linking unused clusters together. Cluster-centric compression is interesting, because it can exploit the allocator.</p>
<p>So, the outline is that we compress every <span class="math">\(n\)</span> adjacent clusters to some <span class="math">\(n' < n%>\)</span>, then we can free the excessive clusters in this compressed line.</p>
<h1 id="copyonwrite">Copy-on-write</h1>
<p>Our algorithm is not writable, but it can be written by allocating, copying, and deallocating. This is called copy-on-write, or COW for short. It is a common technique used in many file systems.</p>
<p>Essentially, we never write a cluster. Instead, we allocate a new cluster, and copy the data to it. Then we deallocate the old cluster.</p>
<p>This allows us to approach everything much more functionally, and we thus don't have to worry about make compressible blocks uncompressible (consider that you overwrite a highly compressible cluster with random data, then you extend a physical cluster containing many virtual clusters, these wouldn't be possible to have in one cluster).</p>
<h1 id="physical-and-virtual-clusters">Physical and virtual clusters</h1>
<p>Our goal is really fit multiple clusters into one physical cluster. Therefore, it is essential to distinguish between physical (the stored) and virtual (the compressed) clusters.</p>
<p>A physical cluster can contain up to 8 virtual clusters. A pointer to a virtual cluster starts with 3 bits defining the index into the physical cluster, which is defined by the rest of the pointer.</p>
<p>The allocated physical cluster contains 8 bitflags, defining which of the 8 virtual clusters in the physical cluster are used. This allows us to know how many virtual clusters we need to go over before we get the target decompressed cluster.</p>
<p>When the integer hits zero (i.e. all the virtual clusters are freed), the physical cluster is freed.</p>
<p>Since an active cluster will never have the state zero, we use this blind state to represent an uncompressed physical cluster. This means we maximally have one byte in space overhead for uncompressible clusters.</p>
<p><figure><img src="http://ticki.github.io/img/virtual_physical_random_access_compression_diagram.svg" alt="A diagram"></figure></p>
<h1 id="the-physical-cluster-allocator">The physical cluster allocator</h1>
<p>The cluster allocator is nothing but a linked list of clusters. Every free cluster links to another free cluster or NIL (no more free clusters).</p>
<p>This method is called SLOB (Simple List Of Objects) and has the advantage of being complete zero-cost in that there is no wasted space.</p>
<p><figure><img src="http://ticki.github.io/img/slob_allocation_diagram.svg" alt="Physical allocation is simply linked list of free objects."></figure></p>
<h1 id="the-virtual-cluster-allocator">The virtual cluster allocator</h1>
<p>Now we hit the meat of the matter.</p>
<p>When virtual cluster is allocated, we read from the physical cluster list. The first thing we will check is if we can fit in our virtual cluster into the cluster next to the head of the list (we wrap if we reach the end).</p>
<p>If we can fit it in <em>and</em> we have less than 8 virtual clusters in this physical cluster, we will put it into the compressed physical cluster at the first free virtual slot (and then set the respective bitflag):</p>
<p><figure><img src="http://ticki.github.io/img/allocating_compressed_virtual_page_into_next_diagram.svg" alt="We try to fit it into the next cluster."></figure></p>
<p>If we cannot, we pop the list and use the fully-free physical cluster to store etablish a new stack of virtual clusters. It starts as uncompressed:</p>
<p><figure><img src="http://ticki.github.io/img/pop_and_create_new_uncompressed_cluster_diagram.svg" alt="We pop the list and put the virtual cluster in the physical uncompressed slot."></figure></p>
<h1 id="properties-of-this-approach">Properties of this approach</h1>
<p>This approach to writable random-access compression has some very nice properties.</p>
<h2 id="compression-miss">Compression miss</h2>
<p>We call it a compression miss when we need to pop from the freelist (i.e. we cannot fit it into the cluster next to the head). When you allocate you can maximally have one compression miss, and therefore allocation is constant-time.</p>
<h2 id="every-cluster-has-a-sister-cluster">Every cluster has a sister cluster</h2>
<p>Because the "next cluster or wrap" function is bijective, we're sure that we try to insert a virtual cluster to every cluster at least once. This wouldn't be true if we used a hash function or something else.</p>
<p>This has the interesting consequence that filled clusters won't be tried to allocate in multiple times.</p>
<h1 id="limitations">Limitations</h1>
<p>A number of limitations are in this algorithms. The first and most obvious one is the limitation on the compression ratio. This is a minor one: it limits the ratio to maxmially slightly less than 1:8.</p>
<p>A more important limitation is fragmentation. If I allocate many clusters and then deallocate some of them such that many adjacent physical clusters only contain one virtual cluster, this row will have a compression ratio of 1:1 until they're deallocated. Note that it is very rare that this happens, and will only marginally affect the global compression ratio.</p>
<h1 id="update-an-idea">Update: An idea</h1>
<p>A simple trick can improve performance in some cases. Instead of compressing all the virtual clusters in a physical cluster together, you should compress each virtual cluster seperately and place them sequentially (with some delimiter) in the physical cluster.</p>
<p>If your compression algorithm is streaming, you can much faster iterate to the right delimiter, and then only decompress that virtual cluster.</p>
<p>This has the downside of making the compression ratio worse. One solution is to have an initial dictionary (if using a dictionary-based compression algorithm).</p>
<p>Another idea is to eliminate the cluster state and replace it by repeated delimiters. I need to investigate this some more with benchmarks and so on in order to tell if this is actually superior to having a centralized cluster state.</p>
Skip Lists: Done Right
http://ticki.github.io/blog/skip-lists-done-right/
Sat, 17 Sep 2016 13:46:49 +0200http://ticki.github.io/blog/skip-lists-done-right/
<p><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.6.0/katex.min.css"></p>
<h1 id="what-is-a-skip-list">What is a skip list?</h1>
<p>In short, skip lists are a linked-list-like structure which allows for fast search. It consists of a base list holding the elements, together with a tower of lists maintaining a linked hierarchy of subsequences, each skipping over fewer elements.</p>
<p>Skip list is a wonderful data structure, one of my personal favorites, but a trend in the past ten years has made them more and more uncommon as a single-threaded in-memory structure.</p>
<p>My take is that this is because of how hard they are to get right. The simplicity can easily fool you into being too relaxed with respect to performance, and while they are simple, it is important to pay attention to the details.</p>
<p>In the past five years, people have become increasingly sceptical of skip lists’ performance, due to their poor cache behavior when compared to e.g. B-trees, but fear not, a good implementation of skip lists can easily outperform B-trees while being implementable in only a couple of hundred lines.</p>
<p>How? We will walk through a variety of techniques that can be used to achieve this speed-up.</p>
<p>These are my thoughts on how a bad and a good implementation of skip list looks like.</p>
<h2 id="advantages">Advantages</h2>
<ul>
<li>Skip lists perform very well on rapid insertions because there are no rotations or reallocations.</li>
<li>They’re simpler to implement than both self-balancing binary search trees and hash tables.</li>
<li>You can retrieve the next element in constant time (compare to logarithmic time for inorder traversal for BSTs and linear time in hash tables).</li>
<li>The algorithms can easily be modified to a more specialized structure (like segment or range “trees”, indexable skip lists, or keyed priority queues).</li>
<li>Making it lockless is simple.</li>
<li>It does well in persistent (slow) storage (often even better than AVL and EH).</li>
</ul>
<h1 id="a-naïve-but-common-implementation">A naïve (but common) implementation</h1>
<p><img src="https://i.imgur.com/nNjOtfa.png" alt="Each shortcut has its own node." /></p>
<p>Our skip list consists of (in this case, three) lists, stacked such that the <b style="font: 400 1.21em KaTeX_Math">n</b>‘th list visits a subset of the node the <b style="font: 400 1.21em KaTeX_Math">n - 1</b>‘th list does. This subset is defined by a probability distribution, which we will get back to later.</p>
<p>If you rotate the skip list and remove duplicate edges, you can see how it resembles a binary search tree:</p>
<p><img src="https://i.imgur.com/DO031ek.png" alt="A binary search tree." /></p>
<p>Say I wanted to look up the node “30”, then I’d perform normal binary search from the root and down. Due to duplicate nodes, we use the rule of going right if both children are equal:</p>
<p><img src="https://i.imgur.com/H5KjvqC.png" alt="Searching the tree." /></p>
<p>Self-balancing Binary Search Trees often have complex algorithms to keep the tree balanced, but skip lists are easier: They aren’t trees, they’re similar to trees in some ways, but they are not trees.</p>
<p>Every node in the skip list is given a “height”, defined by the highest level containing the node (similarly, the number of decendants of a leaf containing the same value). As an example, in the above diagram, “42” has height 2, “25” has height 3, and “11” has height 1.</p>
<p>When we insert, we assign the node a height, following the probability distribution:</p>
<p><center style="font: 400 1.21em KaTeX_Math;font-style: italic;"> p(n) = 2<sup>1-n</sup> </center></p>
<p>To obtain this distribution, we flip a coin until it hits tails, and count the flips:</p>
<pre><code>uint generate_level() {
uint n = 0;
while coin_flip() {
n++;
}
return n;
}
</code></pre>
<p>By this distribution, statistically the parent layer would contain half as many nodes, so searching is amortized <b style="font: 400 1.21em KaTeX_Main">O(log <i>n</i>) </b>.</p>
<p>Note that we only have pointers to the right and below node, so insertion must be done while searching, that is, instead of searching and then inserting, we insert whenever we go a level down (pseudocode):</p>
<pre><code>-- Recursive skip list insertion function.
define insert(elem, root, height, level):
if right of root < elem:
-- If right isn't "overshot" (i.e. we are going to long), we go right.
return insert(elem, right of root, height, level)
else:
if level = 0:
-- We're at bottom level and the right node is overshot, hence
-- we've reached our goal, so we insert the node inbetween root
-- and the node next to root.
old ← right of root
right of root ← elem
right of elem ← old
else:
if level ≤ height:
-- Our level is below the height, hence we need to insert a
-- link before we go on.
old ← right of root
right of root ← elem
right of elem ← old
-- Go a level down.
return insert(elem, below root, height, level - 1)
</code></pre>
<p>The above algorithm is recursive, but we can with relative ease turn it into an iterative form (or let tail-call optimization do the job for us).</p>
<p>As an example, here’s a diagram, the curved lines marks overshoots/edges where a new node is inserted:</p>
<p><img src="https://i.imgur.com/jr9V8Ot.png" alt="An example" /></p>
<h1 id="waste-waste-everywhere">Waste, waste everywhere</h1>
<p>That seems fine doesn’t it? No, not at all. It’s absolute garbage.</p>
<p>There is a total and complete waste of space going on. Let’s assume there are <b style="font: 400 1.21em KaTeX_Math">n</b> elements, then the tallest node is approximately <b style="font: 400 1.21em KaTeX_Main"><i>h = </i>log<sub>2</sub> <i>n</i></b>, that gives us approximately <b style="font: 400 1.21em KaTeX_Main">1 + Σ<sub><i>k ←0..h</i></sub> <i></i>2<sup><i>-k</i></sup> n ≈ 2<i>n</i></b>.</p>
<p><b style="font: 400 1.21em KaTeX_Math">2<i>n</i></b> is certainly no small amount, especially if you consider what each node contains, a pointer to the inner data, the node right and down, giving 5 pointers in total, so a single structure of <b style="font: 400 1.21em KaTeX_Math"><i>n</i></b> nodes consists of approximately <b style="font: 400 1.21em KaTeX_Math">6<i>n</i></b> pointers.</p>
<p>But memory isn’t even the main concern! When you need to follow a pointer on every decrease (apprx. 50% of all the links), possibly leading to cache misses. It turns out that there is a really simple fix for solving this:</p>
<p>Instead of linking vertically, a good implementation should consist of a singly linked list, in which each node contains an array (representing the nodes above) with pointers to later nodes:</p>
<p><img src="https://i.imgur.com/Fd6gDLv.png" alt="A better skip list." /></p>
<p>If you represent the links (“shortcuts”) through dynamic arrays, you will still often get cache miss. Particularly, you might get a cache miss on both the node itself (which is not data local) and/or the dynamic array. As such, I recommend using a fixed-size array (beware of the two negative downsides: 1. more space usage, 2. a hard limit on the highest level, and the implication of linear upperbound when <i style="font: 400 1.21em KaTeX_Math">h > c</i>. Furthermore, you should keep small enough to fit a cache line.).</p>
<p>Searching is done by following the top shortcuts as long as you don’t overshoot your target, then you decrement the level and repeat, until you reach the lowest level and overshoot. Here’s an example of searching for “22”:</p>
<p><img src="https://i.imgur.com/cQsPnGa.png" alt="Searching for "22"." /></p>
<p>In pseudocode:</p>
<pre><code>define search(skip_list, needle):
-- Initialize to the first node at the highest level.
level ← max_level
current_node ← root of skip_list
loop:
-- Go right until we overshoot.
while level'th shortcut of current_node < needle:
current_node ← level'th shortcut of current_node
if level = 0:
-- We hit our target.
return current_node
else:
-- Decrement the level.
level ← level - 1
</code></pre>
<h1 id="b-style-font-400-1-21em-katex-math-o-1-b-level-generation"><b style="font: 400 1.21em KaTeX_Math">O(1)</b> level generation</h1>
<p>Even William Pugh did this mistake in <a href="http://epaperpress.com/sortsearch/download/skiplist.pdf">his original paper</a>. The problem lies in the way the level is generated: Repeating coin flips (calling the random number generator, and checking parity), can mean a couple of RNG state updates (approximately 2 on every insertion). If your RNG is a slow one (e.g. you need high security against DOS attacks), this is noticable.</p>
<p>The output of the RNG is uniformly distributed, so you need to apply some function which can transform this into the desired distribution. My favorite is this one:</p>
<pre><code>define generate_level():
-- First we apply some mask which makes sure that we don't get a level
-- above our desired level. Then we find the first set bit.
ffz(random() & ((1 << max_level) - 1))
</code></pre>
<p>This of course implies that you <code>max_level</code> is no higher than the bit width of the <code>random()</code> output. In practice, most RNGs return 32-bit or 64-bit integers, which means this shouldn’t be a problem, unless you have more elements than there can be in your address space.</p>
<h1 id="improving-cache-efficiency">Improving cache efficiency</h1>
<p>A couple of techniques can be used to improve the cache efficiency:</p>
<h2 id="memory-pools">Memory pools</h2>
<p><img src="https://i.imgur.com/Wa8IVBJ.png" alt="A skip list in a memory pool." /></p>
<p>Our nodes are simply fixed-size blocks, so we can keep them data local, with high allocation/deallocation performance, through linked memory pools (SLOBs), which is basically just a list of free objects.</p>
<p>The order doesn’t matter. Indeed, if we swap “9” and “7”, we can suddenly see that this is simply a skip list:</p>
<p><img src="https://i.imgur.com/O863RR1.png" alt="It's true." /></p>
<p>We can keep these together in some arbitrary number of (not necessarily consecutive) pages, drastically reducing cache misses, when the nodes are of smaller size.</p>
<p>Since these are pointers into memory, and not indexes in an array, we need not reallocate on growth. We can simply extend the free list.</p>
<h2 id="flat-arrays">Flat arrays</h2>
<p>If we are interested in compactness and have a insertion/removal ratio near to 1, a variant of linked memory pools can be used: We can store the skip list in a flat array, such that we have indexes into said array instead of pointers.</p>
<h2 id="unrolled-lists">Unrolled lists</h2>
<p>Unrolled lists means that instead of linking each element, you link some number of fixed-size chuncks contains two or more elements (often the chunk is around 64 bytes, i.e. the normal cache line size).</p>
<p>Unrolling is essential for a good cache performance. Depending on the size of the objects you store, unrolling can reduce cache misses when following links while searching by 50-80%.</p>
<p>Here’s an example of an unrolled skip list:</p>
<p><img src="https://i.imgur.com/FYpPQPh.png" alt="A simple 4 layer unrolled skip list." /></p>
<p>The gray box marks excessive space in the chunk, i.e. where new elements can be placed. Searching is done over the skip list, and when a candidate is found, the chunk is searched through <strong>linear</strong> search. To insert, you push to the chunk (i.e. replace the first free space). If no excessive space is available, the insertion happens in the skip list itself.</p>
<p>Note that these algorithms requires information about how we found the chunk. Hence we store a “back look”, an array of the last node visited, for each level. We can then backtrack if we couldn’t fit the element into the chunk.</p>
<p>We effectively reduce cache misses by some factor depending on the size of the object you store. This is due to fewer links need to be followed before the goal is reached.</p>
<h1 id="self-balancing-skip-lists">Self-balancing skip lists</h1>
<p>Various techniques can be used to improve the height generation, to give a better distribution. In other words, we make the level generator aware of our nodes, instead of purely random, independent RNGs.</p>
<h2 id="self-correcting-skip-list">Self-correcting skip list</h2>
<p>The simplest way to achieve a content-aware level generator is to keep track of the number of node of each level in the skip list. If we assume there are <b style="font: 400 1.21em KaTeX_Math"><i>n</i></b> nodes, the expected number of nodes with level <b style="font: 400 1.21em KaTeX_Math"><i>l</i></b> is <b style="font: 400 1.21em KaTeX_Main">2<sup><i>-l</i></sup><i>n</i></b>. Subtracting this from actual number gives us a measure of how well-balanced each height is:</p>
<p><img src="http://i.imgur.com/bBf7kcg.png" alt="Balance" /></p>
<p>When we generate a new node’s level, you choose one of the heights with the biggest under-representation (see the black line in the diagram), either randomly or by some fixed rule (e.g. the highest or the lowest).</p>
<h2 id="perfectly-balanced-skip-lists">Perfectly balanced skip lists</h2>
<p>Perfect balancing often ends up hurting performance, due to backwards level changes, but it is possible. The basic idea is to reduce the most over-represented level when removing elements.</p>
<h1 id="an-extra-remark">An extra remark</h1>
<p>Skip lists are wonderful as an alternative to Distributed Hash Tables. Performance is mostly about the same, but skip lists are more DoS resistant if you make sure that all links are F2F.</p>
<p>Each node represents a node in the network. Instead of having a head node and a nil node, we connect the ends, so any machine can search starting at it self:</p>
<p><img src="https://i.imgur.com/moD7oy9.png" alt="A network organized as a skip list." /></p>
<p>If you want a secure open system, the trick is that any node can invite a node, giving it a level equal to or lower than the level itself. If the node control the key space in the interval of A to B, we partition it into two and transfer all KV pairs in the second part to the new node. Obviously, this approach has no privilege escalation, so you can’t initialize a sybil attack easily.</p>
<h1 id="conclusion-and-final-words">Conclusion and final words</h1>
<p>By apply a lot of small, subtle tricks, we can drastically improve performance of skip lists, providing a simpler and faster alternative to Binary Search Trees. Many of these are really just minor tweaks, but give an absolutely enormous speed-up.</p>
<p>The diagrams were made with <a href="https://en.wikipedia.org/wiki/Dia_(software)">Dia</a> and <a href="https://en.wikipedia.org/wiki/PGF/TikZ">TikZ</a>.</p>