In my recent post on optimising zlib decompression for the Apple M1, I used a loop that refilled a bit-buffer and decoded a huffman code each iteration, based on variant 4 from Fabian Giesen’s *Reading bits in far too many ways (part 2)*. This had a 6-cycle decode, plus only 2-cycles of refill latency on the critical path – much better than the naive approach. Since then, I figured out a (possibly novel) strategy for a ~20% faster loop, by taking the refill off the critical path entirely, translating to ~1.06x DEFLATE decompression speed.

I’d spent some time staring at the old loop, trying to save a cycle before writing that post:

```
// 2-cycle refill:
bitbuf = 0;
bitcount = 0;
do {
// refill:
bitbuf |= read64LE(bitptr) << bitcount; // 2c
bitptr += 7;
bitptr -= ((bitcount >> 3) & 7);
bitcount |= 56;
// decode one:
value = table[bitbuf & 0x1FF]; // 5c
bitcount -= value; // 1c
bitbuf >>= (value & 63);
*output++ = value >> 8;
} while (--n);
```

Surely 7-cycle latency was possible? After that time, the idea was surprisingly simple: don’t let the bit-buffer get empty. If we refill while the bit-buffer still has enough bits for the following table-lookup, that table-lookup needn’t wait for the refill. It can instead use the data from the old bit-buffer. By the time the table-lookup is complete, the refill will have completed, and the refilled bit-buffer can be used in later operations, such as the shift.

```
// Zero refill latency:
// initial fill (bitbuf can't be empty)
bitbuf = read64LE(bitptr);
bitptr += 7;
bitcount = 56;
do {
// don't immediately set bitbuf
pending = bitbuf | (read64LE(bitptr) << bitcount);
bitptr += 7;
bitptr -= ((bitcount >> 3) & 7);
bitcount |= 56;
// load using old bitbuf
value = table[bitbuf & 0x1FF]; // 5c
// commit the new bitbuf
bitbuf = pending;
bitcount -= value;
bitbuf >>= (value & 63); // 1c
*output++ = value >> 8;
// can have more unrolled 6c decodes as usual:
// --n;
// value = table[bitbuf & 0x1FF];
// bitcount -= value;
// bitbuf >>= (value & 63);
// *output++ = value >> 8;
} while (--n);
```

In theory, this should have 6 cycle latency. Rough real-world measurements give ~6.77 cycles, down from ~8.33 cycles. Not quite the 25% theoretical reduction. Low enough that I’m convinced that there isn’t a 7-cycle latency chain I’ve overlooked, but high enough that it’s still clearly beneficial to unroll to avoid refilling.

In my unstable, experimental zlib fork, this change translates to ~1.06x speed, requiring some extra logic to make sure we don’t run out of bits. (If we have less than 10 bits after decoding a length/distance pair, I actually do a full refill before the optimised refill, which doesn’t make much sense, but it’s rare enough that that’s what happened to work.)

[*Update*: I put an optimised x86-64 version with UICA links on Godbolt. A lot of the complexity is to work around move elimination being disabled in Ice Lake/Tiger Lake, by using three-operand BMI instructions.]

## Bonus detail

In a branch-free decode loop using this strategy, we can read at most 56 bits before refilling. This is because we refill such that 56 <= bitcount <= 63, so reading 57-bits would cause this to wrap around. However, we actually fill bitbuf with 64 bits of data each refill, and we can rely on those extra bits when we’re using last-iteration’s bit-buffer, because we aren’t using last-iteration’s bitcount.

So for unconditional-refill you get 8 bits of slack (as we must assume bitcount could be as low as 56), but for code that branches on bitcount you only get 1 bit of slack (as bitcount after refill may be as high as 63, reflecting all but one bit in bitbuf).

Specifically this could work nicely for a simpler DEFLATE decoder that unconditionally refills after each literal or length/distance pair (e.g. Wuffs-the-library’s). This could read a worst-case 48-bit length/distance pair, while still having 16-bits left in bitbuf, for use by the first table-lookup of the next iteration.

## Final notes

I believe this is a new discovery, but please let me know if there’s prior art I should credit.

With this change (and a smaller, probably Apple-CPU-specific chunkcopy change), my experimental zlib fork has improved ~8% since the last post, requiring me to rescale my graph:

This looks a bit less impressive, as libdeflate is not only included in the graph, but has also improved significantly since my last post. My code reaches ~5.28x speed compared to zlib-madler, ~2.30x compared to zlib-apple, ~1.64x compared to zlib-cloudflare, and ~1.10x compared to libdeflate (for now).

There’s a lot still to be explored, like putting pairs of literals with short codes in the decode table (huge win for some data, as we can read two literals for the price of one), and Fabian’s suggested improvements to table building (much nicer, but I’m still getting my head around the transpose [update: the more detailed explanation and sample code clarified this for me], and need to decide how it’ll work with multi-level tables).

The zlib-dougallj code is on Github, under the zlib license, but has not been thoroughly tested. I’m @dougallj on Twitter.