Notes on Ticki's blog
http://ticki.github.io/tags/notes/index.xml
Recent content in Notes on Ticki's blogHugo -- gohugo.ioen-usDesigning a good non-cryptographic hash function
http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/
Fri, 04 Nov 2016 16:28:44 +0200http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>So, I've been needing a hash function for various purposes, lately. None of the existing hash functions I could find were sufficient for my needs, so I went and designed my own. These are my notes on the design of hash functions.</p>
<h1 id="what-is-a-hash-function-really">What is a hash function <em>really</em>?</h1>
<p>Hash functions are functions which maps a infinite domain to a finite codomain. Two elements in the domain, <span class="math">\(a, b\)</span> are said to collide if <span class="math">\(h(a) = h(b)\)</span>.</p>
<p>The ideal hash functions has the property that the distribution of image of a a subset of the domain is statistically independent of the probability of said subset occuring. That is, collisions are not likely to occur even within non-uniform distributed sets.</p>
<p>Consider you have an english dictionary. Clearly, <code>hello</code> is more likely to be a word than <code>ctyhbnkmaasrt</code>, but the hash function must not be affected by this statistical redundancy.</p>
<p>In a sense, you can think of the ideal hash function as being a function where the output is uniformly distributed (e.g., chosen by a sequence of coinflips) over the codomain no matter what the distribution of the input is.</p>
<p>With a good hash function, it should be hard to distinguish between a truely random sequence and the hashes of some permutation of the domain.</p>
<p>Hash function ought to be as chaotic as possible. A small change in the input should appear in the output as if it was a big change. This is called the hash function butterfly effect.</p>
<h2 id="noncryptographic-and-cryptographic">Non-cryptographic and cryptographic</h2>
<p>One must make the distinction between cryptographic and non-cryptographic hash functions. In a cryptographic hash function, it must be infeasible to:</p>
<ol>
<li>Generate the input from its hash output.</li>
<li>Generate two inputs with the same output.</li>
</ol>
<p>Non-cryptographic hash functions can be thought of as approximations of these invariants. The reason for the use of non-cryptographic hash function is that they're significantly faster than cryptographic hash functions.</p>
<h1 id="diffusions-and-bijection">Diffusions and bijection</h1>
<p>The basic building block of good hash functions are difussions. Difussions can be thought of as bijective (i.e. every input has one and only one output, and vice versa) hash functions, namely that input and output are uncorrelated:</p>
<p><figure><img src="http://ticki.github.io/img/bijective_diffusion_diagram.svg" alt="A diagram of a diffusion."></figure></p>
<p>This diffusion function has a relatively small domain, for illustrational purpose.</p>
<h2 id="building-a-good-diffusion">Building a good diffusion</h2>
<p>Diffusions are often build by smaller, bijective components, which we will call "subdiffusions".</p>
<h3 id="types-of-subdiffusions">Types of subdiffusions</h3>
<p>One must distinguish between the different kinds of subdiffusions.</p>
<p>The first class to consider is the <strong>bitwise subdiffusions</strong>. These are quite weak when they stand alone, and thus must be combined with other types of subdiffusions. Bitwise subdiffusions might flip certain bits and/or reorganize them:</p>
<p><span class="math">\[d(x) = \sigma(x) \oplus m\]</span></p>
<p>(we use <span class="math">\(\sigma\)</span> to denote permutation of bits)</p>
<p>The second class is <strong>dependent bitwise subdiffusions</strong>. These are diffusions which permutes the bits and XOR them with the original value:</p>
<p><span class="math">\[d(x) = \sigma(x) \oplus x\]</span></p>
<p>(exercise to reader: prove that the above subdivision is revertible)</p>
<p>Another similar often used subdiffusion in the same class is the XOR-shift:</p>
<p><span class="math">\[d(x) = (x \ll m) \oplus x\]</span></p>
<p>(note that <span class="math">\(m\)</span> can be negative, in which case the bitshift becomes a right bitshift)</p>
<p>The next subdiffusion are of massive importance. It's the class of <strong>linear subdiffusions</strong> similar to the LCG random number generator:</p>
<p><span class="math">\[d(x) \equiv ax + c \pmod m, \quad \gcd(x, m) = 1\]</span></p>
<p>(<span class="math">\(\gcd\)</span> means "greatest common divisor", this constraint is necessary in order to have <span class="math">\(a\)</span> have an inverse in the ring)</p>
<p>The next are particularly interesting, it's the <strong>arithmetic subdiffusions</strong>:</p>
<p><span class="math">\[d(x) = x \oplus (x + c)\]</span></p>
<h3 id="combining-subdiffusions">Combining subdiffusions</h3>
<p>Subdiffusions themself are quite poor quality. Combining them is what creates a good diffusion function.</p>
<p>Indeed if you combining enough different subdiffusions, you get a good diffusion function, but there is a catch: The more subdiffusions you combine the slower it is to compute.</p>
<p>As such, it is important to find a small, diverse set of subdiffusions which has a good quality.</p>
<h3 id="zerosensitivity">Zero-sensitivity</h3>
<p>If your diffusion isn't zero-sensitive (i.e., <span class="math">\(f(0) = \{0, 1\}\)</span>), you should <del>panic</del> come up with something better. In particular, make sure your diffusion contains at least one zero-sensitive subdiffusion as component.</p>
<h3 id="avalanche-diagrams">Avalanche diagrams</h3>
<p>Avalanche diagrams are the best and quickist way to find out if your diffusion function has a good quality.</p>
<p>Essentially, you draw a grid such that the <span class="math">\((x, y)\)</span> cell's color represents the probability that flipping <span class="math">\(x\)</span>'th bit of the input will result of <span class="math">\(y\)</span>'th bit being flipped in the output. If <span class="math">\((x, y)\)</span> is very red, the probability that <span class="math">\(d(a')\)</span>, where <span class="math">\(a'\)</span> is <span class="math">\(a\)</span> with the <span class="math">\(x\)</span>'th bit flipped,' has the <span class="math">\(y\)</span>'th bit flipped is very high.</p>
<p>Here's an example of the identity function, <span class="math">\(f(x) = x\)</span>:</p>
<p><figure><img src="http://ticki.github.io/img/identity_function_avalanche_diagram.svg" alt="The identity function."></figure></p>
<p>So why is it a straight line?</p>
<p>Well, if you flip the <span class="math">\(n\)</span>'th bit in the input, the only bit flipped in the output is the <span class="math">\(n\)</span>'th bit. That's kind of boring, let's try adding a number:</p>
<p><figure><img src="http://ticki.github.io/img/addition_avalanche_diagram.svg" alt="Adding a big number."></figure></p>
<p>Meh, this is kind of obvious. Let's try multiplying by a prime:</p>
<p><figure><img src="http://ticki.github.io/img/prime_multiplication_avalanche_diagram.svg" alt="Multiplying by a non-even prime is a bijection."></figure></p>
<p>Now, this is quite interesting actually. We call all the black area "blind spots", and you can see here that anything with <span class="math">\(x > y\)</span> is a blind spot. Why is that? Well, if I flip a high bit, it won't affect the lower bits because you can see multiplication as a form of overlay:</p>
<pre><code>100011101000101010101010111
:
111
↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕↕
100000001000101010101010111
:
111
</code></pre>
<p>Flipping a single bit will only change the integer forward, never backwards, hence it forms this blind spot. So how can we fix this (we don't want this bias)?</p>
<h4 id="designing-a-diffusion-function--by-example">Designing a diffusion function -- by example</h4>
<p>If we throw in (after prime multiplication) a dependent bitwise-shift subdiffusions, we have</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \ll z) \\
\end{align*}\]</span></p>
<p>(note that we have the <span class="math">\(+1\)</span> in order to make it zero-sensitive)</p>
<p>This generates following avalanche diagram</p>
<p><figure><img src="http://ticki.github.io/img/shift_xor_multiply_avalanche_diagram.svg" alt="Shift-XOR then multiply."></figure></p>
<p>What can cause these? Clearly there is some form of bias. Turns out that this bias mostly originates in the lack of hybrid arithmetic/bitwise sub.
Without such hybrid, the behavior tends to be relatively local and not interfering well with each other.</p>
<p><span class="math">\[x \gets x + \text{ROL}_k(x)\]</span></p>
<p>At this point, it looks something like</p>
<p><figure><img src="http://ticki.github.io/img/shift_xor_multiply_rotate_avalanche_diagram.svg" alt="Shift-XOR then multiply."></figure></p>
<p>That's good, but we're not quite there yet...</p>
<p>Let's throw in the following bijection:</p>
<p><span class="math">\[x \gets px \oplus (px \gg z)\]</span></p>
<p>And voilà, we now have a perfect bit independence:</p>
<p><figure><img src="http://ticki.github.io/img/perfect_avalanche_diagram.svg" alt="Everything is red!"></figure></p>
<p>So our finalized version of an example diffusion is</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \ll z) \\
x &\gets x + \text{ROL}_k(x) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
\end{align*}\]</span></p>
<p>That seems like a pretty lengthy chunk of operations. We will try to boil it down to few operations while preserving the quality of this diffusion.</p>
<p>The most obvious think to remove is the rotation line. But it hurts quality:</p>
<p><figure><img src="http://ticki.github.io/img/multiply_up_avalanche_diagram.svg" alt="Here's the avalanche diagram of said line removed."></figure></p>
<p>Where do these blind spot comes from? The answer is pretty simple: shifting left moves the entropy upwards, hence the multiplication will never really flip the lower bits. For example, if we flip the sixth bit, and trace it down the operations, you will how it never flips in the other end.</p>
<p>So what do we do? Instead of shifting left, we need to shift right, since multiplication only affects upwards:</p>
<p><span class="math">\[\begin{align*}
x &\gets x + 1 \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
x &\gets px \\
x &\gets x \oplus (x \gg z) \\
\end{align*}\]</span></p>
<p>And we're back again. This time with two less instructions.</p>
<table>
<thead>
<tr>
<th><figure><img src="http://ticki.github.io/img/cakehash_stage_1_avalanche_diagram.svg" alt="Stage 1"></figure></th>
<th><figure><img src="http://ticki.github.io/img/cakehash_stage_2_avalanche_diagram.svg" alt="Stage 2"></figure></th>
</tr>
</thead>
<tbody>
<tr>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_3_avalanche_diagram.svg" alt="Stage 3"></figure></p>
</td>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_4_avalanche_diagram.svg" alt="Stage 4"></figure></p>
</td>
</tr>
<tr>
<td><p><figure><img src="http://ticki.github.io/img/cakehash_stage_5_avalanche_diagram.svg" alt="Stage 5"></figure></p>
</td>
<td><p><figure><img src="http://ticki.github.io/img/perfect_avalanche_diagram.svg" alt="Stage 6"></figure></p>
</td>
</tr>
</tbody>
</table>
<h1 id="combining-diffusions">Combining diffusions</h1>
<p>Diffusions maps a finite state space to a finite state space, as such they're not alone sufficient as arbitrary-length hash function, so we need a way to combine diffusions.</p>
<p>In particular, we can eat <span class="math">\(N\)</span> bytes of the input at once and modify the state based on that:</p>
<p><span class="math">\[s' = d(f(s', x))\]</span></p>
<p>Or in graphic form,</p>
<p><figure><img src="http://ticki.github.io/img/hash_round_flowchart.svg" alt="A flowchart."></figure></p>
<p><span class="math">\(f(s', x)\)</span> is what we call our combinator function. It serves for combining the old state and the new input block (<span class="math">\(x\)</span>). <span class="math">\(d(a)\)</span> is just our diffusion function.</p>
<p>It doesn't matter if the combinator function is commutative or not, but it is crucial that it is not biased, i.e. if <span class="math">\(a, b\)</span> are uniformly distributed variables, <span class="math">\(f(a, b)\)</span> is too. Ideally, there should exist a bijection, <span class="math">\(g(f(a, b), b) = a\)</span>, which implies that it is not biased.</p>
<p>An example of such combination function is simple addition.</p>
<p><span class="math">\[f(a, b) = a + b\]</span></p>
<p>Another is</p>
<p><span class="math">\[f(a, b) = a \oplus b\]</span></p>
<p>I'm partial towards saying that these are the only sane choices for combinator functions, and you must pick between them based on the characteristics of your diffusion function:</p>
<ol>
<li>If your diffusion function is primarily based on arithmetics, you should use the XOR combinator function.</li>
<li>If your diffusion function is primarily based on bitwise operations, you should use the additive combinator function.</li>
</ol>
<p>The reason for this is that you want to have the operations to be as diverse as possible, to create complex, seemingly random behavior.</p>
<h1 id="simd-simd-simd">SIMD, SIMD, SIMD</h1>
<p>If you want good performance, you shouldn't read only one byte at a time. By reading multiple bytes at a time, your algorithm becomes several times faster.</p>
<p>This however introduces the need for some finalization, if the total number of written bytes doesn't divide the number of bytes read in a round. One possibility is to pad it with zeros and write the total length in the end, however this turns out to be somewhat slow for small inputs.</p>
<p>A better option is to write in the number of padding bytes into the last byte.</p>
<h1 id="instruction-level-parallelism">Instruction level parallelism</h1>
<p>Fetching multiple blocks and sequentially (without dependency until last) running a round is something I've found to work well. This has to do with the so-called instruction pipeline in which modern processors run instructions in parallel when they can.</p>
<h1 id="testing-the-hash-function">Testing the hash function</h1>
<p>Multiple test suits for testing the quality and performance of your hash function. <a href="https://github.com/aappleby/smhasher">Smhasher</a> is one of these.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Many relatively simple components can be combined into a strong and robust non-cryptographic hash function for use in hash tables and in checksumming. Deriving such a function is really just coming up with the components to construct this hash function.</p>
<p>Breaking the problem down into small subproblems significantly simplifies analysis and guarantees.</p>
<p>The key to a good hash function is to try-and-miss. Testing and throwing out candidates is the only way you can really find out if you hash function works in practice.</p>
<p>Have fun hacking!</p>
How LZ4 works
http://ticki.github.io/blog/how-lz4-works/
Tue, 25 Oct 2016 23:25:15 +0200http://ticki.github.io/blog/how-lz4-works/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>LZ4 is a really fast compression algorithm with a reasonable compression ratio, but unfortunately there is limited documentation on how it works. The only explanation (not spec, explanation) <a href="https://fastcompression.blogspot.com/2011/05/lz4-explained.html">can be found</a> on the author's blog, but I think it is less of an explanation and more of an informal specification.</p>
<p>This blog post tries to explain it such that anybody (even new beginners) can understand and implement it.</p>
<h1 id="linear-smallinteger-code-lsic">Linear small-integer code (LSIC)</h1>
<p>The first part of LZ4 we need to explain is a smart but simple integer encoder. It is very space efficient for 0-255, and then grows linearly, based on the assumption that the integers used with this encoding rarely exceeds this limit, as such it is only used for small integers in the standard.</p>
<p>It is a form of addition code, in which we read a byte. If this byte is the maximal value (255), another byte is read and added to the sum. This process is repeated until a byte below 255 is reached, which will be added to the sum, and the sequence will then end.</p>
<p><figure><img src="http://ticki.github.io/img/lz4_int_encoding_flowchart.svg" alt="We try to fit it into the next cluster."></figure></p>
<p>In short, we just keep adding bytes and stop when we hit a non-0xFF byte.</p>
<p>We'll use the name "LSIC" for convinience.</p>
<h1 id="block">Block</h1>
<p>An LZ4 stream is divided into segments called "blocks". Blocks contains a literal which is to be copied directly to the output stream, and then a back reference, which tells us to copy some number of bytes from the already decompressed stream.</p>
<p>This is really were the compression is going on. Copying from the old stream allows deduplication and runs-length encoding.</p>
<h2 id="overview">Overview</h2>
<p>A block looks like:</p>
<p><span class="math">\[\overbrace{\underbrace{t_1}_\text{4 bits}\ \underbrace{t_2}_\text{4 bits}}^\text{Token} \quad \underbrace{\overbrace{e_1}^\texttt{LISC}}_\text{If $t_1 = 15$} \quad \underbrace{\overbrace{L}^\text{Literal}}_{t_1 + e_1\text{ bytes }} \quad \overbrace{\underbrace{O}_\text{2 bytes}}^\text{Little endian} \quad \underbrace{\overbrace{e_2}^\texttt{LISC}}_\text{If $t_2 = 15$}\]</span></p>
<p>And decodes to the <span class="math">\(L\)</span> segment, followed by a <span class="math">\(t_2 + e_2 + 4\)</span> bytes sequence copied from position <span class="math">\(l - O\)</span> from the output buffer (where <span class="math">\(l\)</span> is the length of the output buffer).</p>
<p>We will explain all of these in the next sections.</p>
<h2 id="token">Token</h2>
<p>Any block starts with a 1 byte token, which is divided into two 4-bit fields.</p>
<h2 id="literals">Literals</h2>
<p>The first (highest) field in the token is used to define the literal. This obviously takes a value 0-15.</p>
<p>Since we might want to encode higher integer, as such we make use of LSIC encoding: If the field is 15 (the maximal value), we read an integer with LSIC and add it to the original value (15) to obtain the literals length.</p>
<p>Call the final value <span class="math">\(L\)</span>.</p>
<p>Then we forward the next <span class="math">\(L\)</span> bytes from the input stream to the output stream.</p>
<p><figure><img src="http://ticki.github.io/img/lz4_literals_copy_diagram.svg" alt="We copy from the buffer directly."></figure></p>
<h2 id="deduplication">Deduplication</h2>
<p>The next few bytes are used to define some segment in the already decoded buffer, which is going to be appended to the output buffer.</p>
<p>This allows us to transmit a position and a length to read from in the already decoded buffer instead of transmitting the literals themself.</p>
<p>To start with, we read a 16-bit little endian integer. This defines the so called offset, <span class="math">\(O\)</span>. It is important to understand that the offset is not the starting position of the copied buffer. This starting point is calculated by <span class="math">\(l - O\)</span> with <span class="math">\(l\)</span> being the number of bytes already decoded.</p>
<p>Secondly, similarly to the literals length, if <span class="math">\(t_2\)</span> is 15 (the maximal value), we use LSIC to "extend" this value and we add the result. This plus 4 yields the number of bytes we will copy from the output buffer. The reason we add 4 is because copying less than 4 bytes would result in a negative expansion of the compressed buffer.</p>
<p>Now that we know the start position and the length, we can append the segment to the buffer itself:</p>
<p><figure><img src="http://ticki.github.io/img/lz4_deduplicating_diagram.svg" alt="Copying in action."></figure></p>
<p>It is important to understand that the end of the segment might not be initializied before the rest of the segment is appended, because overlaps are allowed. This allows a neat trick, namely "runs-length encoding", where you repeat some sequence a given number of times:</p>
<p><figure><img src="http://ticki.github.io/img/lz4_runs_encoding_diagram.svg" alt="We repeat the last byte."></figure></p>
<p>Note that the duplicate section is not required if you're in the end of the stream, i.e. if there's no more compressed bytes to read.</p>
<h1 id="compression">Compression</h1>
<p>Until now, we have only considered decoding, not the reverse process.</p>
<p>A dozen of approaches to compression exists. They have the aspects that they need to be able to find duplicates in the already input buffer.</p>
<p>In general, there are two classes of such compression algorithms:</p>
<ol>
<li>HC: High-compression ratio algorithms, these are often very complex, and might include steps like backtracking, removing repeatation, non-greediy.</li>
<li>FC: Fast compression, these are simpler and faster, but provides a slightly worse compression ratio.</li>
</ol>
<p>We will focus on the FC-class algorithms.</p>
<p>Binary Search Trees (often B-trees) are often used for searching for duplicates. In particular, every byte iterated over will add a pointer to the rest of the buffer to a B-tree, we call the "duplicate tree". Now, B-trees allows us to retrieve the largest element smaller than or equal to some key. In lexiographic ordering, this is equivalent to asking the element sharing the largest number of bytes as prefix.</p>
<p>For example, consider the table:</p>
<pre><code>abcdddd => 0
bcdddd => 1
cdddd => 2
dddd => 3
</code></pre>
<p>If we search for <code>cddda</code>, we'll get a partial match, namely <code>cdddd => 2</code>. So we can quickly find out how many bytes they have in common as prefix. In this case, it is 4 bytes.</p>
<p>What if we found no match or a bad match (a match that shares less than some threshold)? Well, then we write it as literal until a good match is found.</p>
<p>As you may notice, the dictionary grows linearly. As such, it is important that you reduce memory once in a while, by trimming it. Note that just trimming the first (or last) <span class="math">\(N\)</span> entries is inefficient, because some might be used often. Instead, a <a href="https://en.wikipedia.org/wiki/Cache_Replacement_Policies">cache replacement policy</a> should be used. If the dictionary is filled, the cache replacement policy should determine which match should be replaced. I've found PLRU a good choice of CRP for LZ4 compression.</p>
<p>Note that you should add additional rules like being addressible (within <span class="math">\(2^{16} + 4\)</span> bytes of the cursor, which is required because <span class="math">\(O\)</span> is 16-bit) and being above some length (smaller keys have worse block-level compression ratio).</p>
<p>Another faster but worse (compression-wise) approach is hashing every four bytes and placing them in a table. This means that you can only look up the latest sequence given some 4-byte prefix. Looking up allows you to progress and see how long the duplicate sequence match. When you can't go any longer, you encode the literals section until another duplicate 4-byte is found.</p>
<h1 id="conclusion">Conclusion</h1>
<p>LZ4 is a reasonably simple algorithm with reasonably good compression ratio. It is the type of algorithm that you can implement on an afternoon without much complication.</p>
<p>If you need a portable and efficient compression algorithm which can be implement in only a few hundreds of lines, LZ4 would be my go-to.</p>
On Random-Access Compression
http://ticki.github.io/blog/on-random-access-compression/
Sun, 23 Oct 2016 23:25:15 +0200http://ticki.github.io/blog/on-random-access-compression/<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>This post will contains an algorithm I came up with, doing efficient rolling compression. It's going to be used in <a href="https://github.com/ticki/tfs">TFS</a>.</p>
<h1 id="what-is-rolling-compression">What is rolling compression?</h1>
<p>Consider that you have a large file and you want to compress it. That's easy enough and many algorithms exists for doing so. Now, consider that you want to read or write a small part of the file.</p>
<p>Most algorithms would require you to decompress, write, and recompress the whole file. Clearly, this gets expensive when the file is big.</p>
<h1 id="clusterbased-compression">Cluster-based compression</h1>
<p>A cluster is some small fixed-size block (often 512, 1024, or 4096 bytes). We can have a basic cluster allocator by linking unused clusters together. Cluster-centric compression is interesting, because it can exploit the allocator.</p>
<p>So, the outline is that we compress every <span class="math">\(n\)</span> adjacent clusters to some <span class="math">\(n' < n%>\)</span>, then we can free the excessive clusters in this compressed line.</p>
<h1 id="copyonwrite">Copy-on-write</h1>
<p>Our algorithm is not writable, but it can be written by allocating, copying, and deallocating. This is called copy-on-write, or COW for short. It is a common technique used in many file systems.</p>
<p>Essentially, we never write a cluster. Instead, we allocate a new cluster, and copy the data to it. Then we deallocate the old cluster.</p>
<p>This allows us to approach everything much more functionally, and we thus don't have to worry about make compressible blocks uncompressible (consider that you overwrite a highly compressible cluster with random data, then you extend a physical cluster containing many virtual clusters, these wouldn't be possible to have in one cluster).</p>
<h1 id="physical-and-virtual-clusters">Physical and virtual clusters</h1>
<p>Our goal is really fit multiple clusters into one physical cluster. Therefore, it is essential to distinguish between physical (the stored) and virtual (the compressed) clusters.</p>
<p>A physical cluster can contain up to 8 virtual clusters. A pointer to a virtual cluster starts with 3 bits defining the index into the physical cluster, which is defined by the rest of the pointer.</p>
<p>The allocated physical cluster contains 8 bitflags, defining which of the 8 virtual clusters in the physical cluster are used. This allows us to know how many virtual clusters we need to go over before we get the target decompressed cluster.</p>
<p>When the integer hits zero (i.e. all the virtual clusters are freed), the physical cluster is freed.</p>
<p>Since an active cluster will never have the state zero, we use this blind state to represent an uncompressed physical cluster. This means we maximally have one byte in space overhead for uncompressible clusters.</p>
<p><figure><img src="http://ticki.github.io/img/virtual_physical_random_access_compression_diagram.svg" alt="A diagram"></figure></p>
<h1 id="the-physical-cluster-allocator">The physical cluster allocator</h1>
<p>The cluster allocator is nothing but a linked list of clusters. Every free cluster links to another free cluster or NIL (no more free clusters).</p>
<p>This method is called SLOB (Simple List Of Objects) and has the advantage of being complete zero-cost in that there is no wasted space.</p>
<p><figure><img src="http://ticki.github.io/img/slob_allocation_diagram.svg" alt="Physical allocation is simply linked list of free objects."></figure></p>
<h1 id="the-virtual-cluster-allocator">The virtual cluster allocator</h1>
<p>Now we hit the meat of the matter.</p>
<p>When virtual cluster is allocated, we read from the physical cluster list. The first thing we will check is if we can fit in our virtual cluster into the cluster next to the head of the list (we wrap if we reach the end).</p>
<p>If we can fit it in <em>and</em> we have less than 8 virtual clusters in this physical cluster, we will put it into the compressed physical cluster at the first free virtual slot (and then set the respective bitflag):</p>
<p><figure><img src="http://ticki.github.io/img/allocating_compressed_virtual_page_into_next_diagram.svg" alt="We try to fit it into the next cluster."></figure></p>
<p>If we cannot, we pop the list and use the fully-free physical cluster to store etablish a new stack of virtual clusters. It starts as uncompressed:</p>
<p><figure><img src="http://ticki.github.io/img/pop_and_create_new_uncompressed_cluster_diagram.svg" alt="We pop the list and put the virtual cluster in the physical uncompressed slot."></figure></p>
<h1 id="properties-of-this-approach">Properties of this approach</h1>
<p>This approach to writable random-access compression has some very nice properties.</p>
<h2 id="compression-miss">Compression miss</h2>
<p>We call it a compression miss when we need to pop from the freelist (i.e. we cannot fit it into the cluster next to the head). When you allocate you can maximally have one compression miss, and therefore allocation is constant-time.</p>
<h2 id="every-cluster-has-a-sister-cluster">Every cluster has a sister cluster</h2>
<p>Because the "next cluster or wrap" function is bijective, we're sure that we try to insert a virtual cluster to every cluster at least once. This wouldn't be true if we used a hash function or something else.</p>
<p>This has the interesting consequence that filled clusters won't be tried to allocate in multiple times.</p>
<h1 id="limitations">Limitations</h1>
<p>A number of limitations are in this algorithms. The first and most obvious one is the limitation on the compression ratio. This is a minor one: it limits the ratio to maxmially slightly less than 1:8.</p>
<p>A more important limitation is fragmentation. If I allocate many clusters and then deallocate some of them such that many adjacent physical clusters only contain one virtual cluster, this row will have a compression ratio of 1:1 until they're deallocated. Note that it is very rare that this happens, and will only marginally affect the global compression ratio.</p>
<h1 id="update-an-idea">Update: An idea</h1>
<p>A simple trick can improve performance in some cases. Instead of compressing all the virtual clusters in a physical cluster together, you should compress each virtual cluster seperately and place them sequentially (with some delimiter) in the physical cluster.</p>
<p>If your compression algorithm is streaming, you can much faster iterate to the right delimiter, and then only decompress that virtual cluster.</p>
<p>This has the downside of making the compression ratio worse. One solution is to have an initial dictionary (if using a dictionary-based compression algorithm).</p>
<p>Another idea is to eliminate the cluster state and replace it by repeated delimiters. I need to investigate this some more with benchmarks and so on in order to tell if this is actually superior to having a centralized cluster state.</p>