This post will contains an algorithm I came up with, doing efficient rolling compression. It's going to be used in TFS.
What is rolling compression?
Consider that you have a large file and you want to compress it. That's easy enough and many algorithms exists for doing so. Now, consider that you want to read or write a small part of the file.
Most algorithms would require you to decompress, write, and recompress the whole file. Clearly, this gets expensive when the file is big.
A cluster is some small fixed-size block (often 512, 1024, or 4096 bytes). We can have a basic cluster allocator by linking unused clusters together. Cluster-centric compression is interesting, because it can exploit the allocator.
So, the outline is that we compress every \(n\) adjacent clusters to some \(n' < n%>\), then we can free the excessive clusters in this compressed line.
Our algorithm is not writable, but it can be written by allocating, copying, and deallocating. This is called copy-on-write, or COW for short. It is a common technique used in many file systems.
Essentially, we never write a cluster. Instead, we allocate a new cluster, and copy the data to it. Then we deallocate the old cluster.
This allows us to approach everything much more functionally, and we thus don't have to worry about make compressible blocks uncompressible (consider that you overwrite a highly compressible cluster with random data, then you extend a physical cluster containing many virtual clusters, these wouldn't be possible to have in one cluster).
Physical and virtual clusters
Our goal is really fit multiple clusters into one physical cluster. Therefore, it is essential to distinguish between physical (the stored) and virtual (the compressed) clusters.
A physical cluster can contain up to 8 virtual clusters. A pointer to a virtual cluster starts with 3 bits defining the index into the physical cluster, which is defined by the rest of the pointer.
The allocated physical cluster contains 8 bitflags, defining which of the 8 virtual clusters in the physical cluster are used. This allows us to know how many virtual clusters we need to go over before we get the target decompressed cluster.
When the integer hits zero (i.e. all the virtual clusters are freed), the physical cluster is freed.
Since an active cluster will never have the state zero, we use this blind state to represent an uncompressed physical cluster. This means we maximally have one byte in space overhead for uncompressible clusters.
The physical cluster allocator
The cluster allocator is nothing but a linked list of clusters. Every free cluster links to another free cluster or NIL (no more free clusters).
This method is called SLOB (Simple List Of Objects) and has the advantage of being complete zero-cost in that there is no wasted space.
The virtual cluster allocator
Now we hit the meat of the matter.
When virtual cluster is allocated, we read from the physical cluster list. The first thing we will check is if we can fit in our virtual cluster into the cluster next to the head of the list (we wrap if we reach the end).
If we can fit it in and we have less than 8 virtual clusters in this physical cluster, we will put it into the compressed physical cluster at the first free virtual slot (and then set the respective bitflag):
If we cannot, we pop the list and use the fully-free physical cluster to store etablish a new stack of virtual clusters. It starts as uncompressed:
Properties of this approach
This approach to writable random-access compression has some very nice properties.
We call it a compression miss when we need to pop from the freelist (i.e. we cannot fit it into the cluster next to the head). When you allocate you can maximally have one compression miss, and therefore allocation is constant-time.
Every cluster has a sister cluster
Because the "next cluster or wrap" function is bijective, we're sure that we try to insert a virtual cluster to every cluster at least once. This wouldn't be true if we used a hash function or something else.
This has the interesting consequence that filled clusters won't be tried to allocate in multiple times.
A number of limitations are in this algorithms. The first and most obvious one is the limitation on the compression ratio. This is a minor one: it limits the ratio to maxmially slightly less than 1:8.
A more important limitation is fragmentation. If I allocate many clusters and then deallocate some of them such that many adjacent physical clusters only contain one virtual cluster, this row will have a compression ratio of 1:1 until they're deallocated. Note that it is very rare that this happens, and will only marginally affect the global compression ratio.
Update: An idea
A simple trick can improve performance in some cases. Instead of compressing all the virtual clusters in a physical cluster together, you should compress each virtual cluster seperately and place them sequentially (with some delimiter) in the physical cluster.
If your compression algorithm is streaming, you can much faster iterate to the right delimiter, and then only decompress that virtual cluster.
This has the downside of making the compression ratio worse. One solution is to have an initial dictionary (if using a dictionary-based compression algorithm).
Another idea is to eliminate the cluster state and replace it by repeated delimiters. I need to investigate this some more with benchmarks and so on in order to tell if this is actually superior to having a centralized cluster state.