Reading bits with zero refill latency

In my recent post on optimising zlib decompression for the Apple M1, I used a loop that refilled a bit-buffer and decoded a huffman code each iteration, based on variant 4 from Fabian Giesen’s Reading bits in far too many ways (part 2). This had a 6-cycle decode, plus only 2-cycles of refill latency on the critical path – much better than the naive approach. Since then, I figured out a (possibly novel) strategy for a ~20% faster loop, by taking the refill off the critical path entirely, translating to ~1.06x DEFLATE decompression speed.

I’d spent some time staring at the old loop, trying to save a cycle before writing that post:

  // 2-cycle refill:
  bitbuf = 0;
  bitcount = 0;
  do {
    // refill:
    bitbuf |= read64LE(bitptr) << bitcount;  // 2c
    bitptr += 7;
    bitptr -= ((bitcount >> 3) & 7);
    bitcount |= 56;

    // decode one:
    value = table[bitbuf & 0x1FF];           // 5c
    bitcount -= value;                       // 1c
    bitbuf >>= (value & 63);
    *output++ = value >> 8;
  } while (--n);

Surely 7-cycle latency was possible? After that time, the idea was surprisingly simple: don’t let the bit-buffer get empty. If we refill while the bit-buffer still has enough bits for the following table-lookup, that table-lookup needn’t wait for the refill. It can instead use the data from the old bit-buffer. By the time the table-lookup is complete, the refill will have completed, and the refilled bit-buffer can be used in later operations, such as the shift.

  // Zero refill latency:

  // initial fill (bitbuf can't be empty)
  bitbuf = read64LE(bitptr);
  bitptr += 7;
  bitcount = 56;

  do {
    // don't immediately set bitbuf
    pending = bitbuf | (read64LE(bitptr) << bitcount);

    bitptr += 7;
    bitptr -= ((bitcount >> 3) & 7);
    bitcount |= 56;

    // load using old bitbuf
    value = table[bitbuf & 0x1FF];  // 5c

    // commit the new bitbuf
    bitbuf = pending;

    bitcount -= value;
    bitbuf >>= (value & 63);        // 1c
    *output++ = value >> 8;

    // can have more unrolled 6c decodes as usual:
    // --n;
    // value = table[bitbuf & 0x1FF];
    // bitcount -= value;
    // bitbuf >>= (value & 63);
    // *output++ = value >> 8;
  } while (--n);

In theory, this should have 6 cycle latency. Rough real-world measurements give ~6.77 cycles, down from ~8.33 cycles. Not quite the 25% theoretical reduction. Low enough that I’m convinced that there isn’t a 7-cycle latency chain I’ve overlooked, but high enough that it’s still clearly beneficial to unroll to avoid refilling.

In my unstable, experimental zlib fork, this change translates to ~1.06x speed, requiring some extra logic to make sure we don’t run out of bits. (If we have less than 10 bits after decoding a length/distance pair, I actually do a full refill before the optimised refill, which doesn’t make much sense, but it’s rare enough that that’s what happened to work.)

[Update: I put an optimised x86-64 version with UICA links on Godbolt. A lot of the complexity is to work around move elimination being disabled in Ice Lake/Tiger Lake, by using three-operand BMI instructions.]

Bonus detail

In a branch-free decode loop using this strategy, we can read at most 56 bits before refilling. This is because we refill such that 56 <= bitcount <= 63, so reading 57-bits would cause this to wrap around. However, we actually fill bitbuf with 64 bits of data each refill, and we can rely on those extra bits when we’re using last-iteration’s bit-buffer, because we aren’t using last-iteration’s bitcount.

So for unconditional-refill you get 8 bits of slack (as we must assume bitcount could be as low as 56), but for code that branches on bitcount you only get 1 bit of slack (as bitcount after refill may be as high as 63, reflecting all but one bit in bitbuf).

Specifically this could work nicely for a simpler DEFLATE decoder that unconditionally refills after each literal or length/distance pair (e.g. Wuffs-the-library’s). This could read a worst-case 48-bit length/distance pair, while still having 16-bits left in bitbuf, for use by the first table-lookup of the next iteration.

Final notes

I believe this is a new discovery, but please let me know if there’s prior art I should credit.

With this change (and a smaller, probably Apple-CPU-specific chunkcopy change), my experimental zlib fork has improved ~8% since the last post, requiring me to rescale my graph:

This looks a bit less impressive, as libdeflate is not only included in the graph, but has also improved significantly since my last post. My code reaches ~5.28x speed compared to zlib-madler, ~2.30x compared to zlib-apple, ~1.64x compared to zlib-cloudflare, and ~1.10x compared to libdeflate (for now).

There’s a lot still to be explored, like putting pairs of literals with short codes in the decode table (huge win for some data, as we can read two literals for the price of one), and Fabian’s suggested improvements to table building (much nicer, but I’m still getting my head around the transpose [update: the more detailed explanation and sample code clarified this for me], and need to decide how it’ll work with multi-level tables).

The zlib-dougallj code is on Github, under the zlib license, but has not been thoroughly tested. I’m @dougallj on Twitter.

Faster zlib/DEFLATE decompression on the Apple M1 (and x86)

DEFLATE is a relatively slow compression algorithm from 1991, which (along with its wrapper format, zlib) is incredibly widely used, for example in the PNG, Zip and Gzip file formats and the HTTP, SSL, and SSH protocols. As such, I had assumed the most widely used implementation, zlib-madler, was extremely optimised. I’d also imagined Apple’s system zlib might outperform open-source alternatives on their own hardware. I was wrong. Fortunately there are optimised forks of zlib such as zlib-ng, Chromium’s zlib and zlib-cloudflare, which significantly outperform zlib-madler and Apple’s system zlib.

I tried optimising zlib’s decompression for the M1, using zlib-cloudflare as a starting point, and was able to show 1.51x speed compared to zlib-cloudflare, and 2.1x speed compared to Apple’s system zlib:

Graph of decompression performance, showing zlib-dougallj outperforming 3 alternatives by about 50% on each of the files in the Silesia Corpus. zlib-cloudflare is consistently second best, closely followed by zlib-ng, with zlib-apple further behind, outperformed by about 2x. The files have been compressed with zlib level 6, and throughput is measured in compressed mb/s on M1 using Apple clang 13.1.6. zlib-dougallj performs between 282 and 380mb/s, cloudflare performs between 186 and 276mb/s.
xml: apple 122, ng 181, cf 194, dj 290.
x-ray: apple 181, ng 190, cf 209, dj 330.
webster: apple 132, ng 195, cf 205, dj 303.
sao: apple 234, ng 260, cf 276, dj 380.
samba: apple 151, ng 211, cf 221, dj 321.
reymont: apple 131, ng 210, cf 213, dj 338.
osdb: apple 173, ng 206, cf 224, dj 352.
ooffice: apple 142, ng 170, cf 186, dj 291.
nci: apple 119, ng 171, cf 191, dj 282.
mr: apple 164, ng 187, cf 208, dj 329.
mozilla: apple 153, ng 187, cf 199, dj 307.
dickens: apple 148, ng 219, cf 238, dj 344.
Decompression speed comparison (higher is better)

My version is also faster than zlib-cloudflare on x86, at ~1.3x the speed of zlib-cloudflare using clang (~1.7x the speed of Apple’s system zlib), and also ~1.3x using gcc (~1.9x the speed of Ubuntu 20.04’s system zlib) [update: an earlier version incorrectly listed 1.46x and 2.3x, which were the gcc numbers on M1]. However, zlib-dougallj is still 0.89x the speed of Intel’s ISA-L on x86.

If you have a choice, please use a modern alternative like zstd instead of zlib. If you need zlib, consider optimised zlib forks, like zlib-ng or zlib-cloudflare (or libdeflate if you don’t need zlib API compatibility or streaming, or ISA-L if you don’t need zlib API compatibility and you’re on x86). If you maintain a zlib fork or library, please consider incorporating some of my changes. (If you want to use my code, you are free to do so under the zlib license, but it is under-tested, and the bugs tend to be vulnerabilities.)

This speedup is the product of a number of changes, which are discussed in the remainder of the post. The changes interact with each other in very complex ways, so individual speedup measurements shouldn’t be taken seriously – making the same changes in a different order would completely change many numbers.

Bigger root tables

The decoder uses a table to look up bit-sequences and map them to their decoded values. The longest Huffman code allowed by the DEFLATE specification is 15-bits, which would require a 128kb table, so to fit tables comfortably in cache, zlib splits this into a root table and a number of (possibly nested) subtables.

Every time the decoder reads a symbol that isn’t in the root table, we’ll have a branch mispredict, followed by extra latency for the subtable lookup, making root-table-misses very expensive. One of the simplest speedups was increasing root table size for a ~7% decompression speed improvement.

(The table size is still limited to the maximum code length used, meaning this typically has zero performance impact on tiny files, where table-building time dominates, since they’re unlikely to use longer codes.)

Adler32 with UDOT

The zlib wrapper adds an adler32 checksum to the DEFLATE data, which is verified on decoding.

There’s already a NEON SIMD adler32 implementation in zlib-cloudflare, but, not unlike my CRC32 post, the existing implementation doesn’t take full advantage of the unusually wide M1. I added a new implementation which is 2x wider, takes advantage of the amazing UDOT instruction (part of the optional dotprod extension in ARMv8.2, mandatory in v8.4), and uses 64-bit scalar values to allow more iterations of the inner loop before reducing. I think this computes checksums at ~1.5x the speed of the old version (~60GB/s), but the decompression speedup is only about 1.01x, since it was already very fast.

I was very happy to see UDOT-based adler32 added to libdeflate, with impressive performance numbers on non-Apple CPUs too (e.g. ~1.7x speed on Cortex-A55).

Reading Bits

Most of the speedup just comes from reading and applying Fabian Giesen’s posts on Huffman decoding:

The goal was to use roughly the following variant-4-style loop to decode and unconditionally refill with eight-cycle latency on Apple M1:

; do {
;  bitbuf |= read64LE(bitptr) << bitcount 
ldr   x11, [x0]
lsl   x11, x11, x9              ; 1c
orr   x10, x11, x10             ; 1c

;  bitptr += 7
add   x0, x0, #0x7             

;  bitptr -= ((bitcount >> 3) & 7)
ubfx  x11, x9, #3, #3
sub   x0, x0, x11

;  bitcount |= 56
orr   x9, x9, #0x38

;  value = table[bitbuf & 0x1FF]
and   x11, x10, #0x1ff          ; 1c
ldr   w11, [x4, x11, lsl #2]    ; 4c

;  bitcount -= value
sub   x9, x9, x11               ; 1c

;  bitbuf >>= (value & 63)
lsr   x10, x10, x11

;  *output++ = value >> 8
lsr   x11, x11, #8
strb  w11, [x1], #1

; } while (--n);
subs  x8, x8, #0x1
b.ne  x8, start

When we don’t need to refill, we can just do a decode with six-cycle latency:

;  value = table[bitbuf & 0x1FF]
and   x11, x10, #0x1ff          ; 1c
ldr   w11, [x4, x11, lsl #2]    ; 4c

;  bitcount -= value
sub   x9, x9, x11  

;  bitbuf >>= (value & 63)
lsr   x10, x10, x11             ; 1c

;  *output++ = value >> 8
lsr   x11, x11, #8
strb  w11, [x1], #1

Most of the commits are trying to encourage the compiler to generate this code without breaking anything. Simply forcing the compiler to use a 32-bit, shifted load was a ~9% speedup, and moving the shift amount to the least-significant bits was another ~3.5%. (Ignoring the high bits of bitcount was accidentally merged into this commit, but had another measurable speedup. Git is counterintuitive. Unconditional refill was combined with the upcoming fastpath optimisation here.)

As covered in Fabian’s linked blog posts, we’re mostly concerned with latency. The M1 can run 8 instructions per cycle, meaning that we have time to run 64 instructions during that 8-cycle latency chain – a limit we’re unlikely to hit. So the focus is on reducing bit-reading latency and reducing branch mispredicts.

[Update: in a followup post I removed the 2-cycle refill latency from this loop.]

Reading Extra Bits

When reading either the length or distance of an LZ77 pair, the decoder would first read a Huffman code, and then optionally read a number of extra bits:

int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c
int extra_bits = entry >> 8;
int extra = bits & ((1 << extra_bits) - 1);
bits >>= extra_bits; // 1c

This second shift adds a cycle of latency. I changed this to combine the two shifts, then look back at the old value to extract the extra bits, saving a cycle of latency:

int old_bits = bits;
int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c, includes extra
int non_extra_bits = (entry >> 8);
old_bits &= ~(-1 << (entry & 63)); // clear the high bits
int extra = old_bits >> non_extra_bits;

This required significant changes to the table format, making it one of the more complicated and error prone changes. Speedup ~5.5%.

Fast-Path for Literals

Since we can have 6-cycle latency (rather than 8-cycle latency) if we decode without refilling, I added a fast-path to the top of the loop that can decode two literals if they’re in the root table, before proceeding as usual. This is cheap because we already have to branch on whether or not a value is a literal.

This can take the first 20-bits of our 56-bit buffer. Unfortunately, the worst-case for a length+distance pair with extra bits is 48-bits, which we may no longer have, so this requires a conditional refill if we don’t have enough bits while reading a distance code. However, this worst-case is exceedingly rare – the whole point of compression is to use short codes in the common cases – and so the “needs refill” branch predicts essentially perfectly, and this has a very low cost. Speedup ~4%.

Overlapping Mispredict Latency

There’s a fancy “chunkcopy” optimisation, which is responsible for some of zlib-cloudflare’s performance over zlib-madler (by copying LZ77 references 16-bytes at a time), but this can have expensive branch mispredicts. There’s a benefit to performing the refill and Huffman table lookup for the next iteration before the chunkcopy, allowing the load latency to overlap with any mispredict latency. Speedup ~6%.

Table-building tweaks

Table-building isn’t terribly expensive nor terribly frequent, but does involve counting the number of symbols of each length (1-15), which can be problematic due to the latency of reading-back recently written memory. This might be a bit worse on the M1 than on x86, with no zero-latency forwarding (but no extreme penalties). I instead used NEON intrinsics to keep the sum in two 128-bit registers with lower latency for a ~0.5% improvement in decompression time. I also use clang’s bitreverse intrinsic, and split the table-filling loop to have simpler code for filling the (now larger) root table, and only run the more complex code for the (now smaller and less common) subtables, for another ~0.5%. These changes give a total of a ~1% speedup.

(I was trying anything I could think of to try to get to 1.5x, and this was what finally did it… and then I realised my geometric mean calculation was wrong, and I was already past 1.5x without this change. But it is nice to have.)

Final notes

There’s been a lot of excellent (and inspirational) prior work on zlib performance: Adenilson Cavalcanti has been optimizing zlib decompression on ARM in Chromium, Vlad Krasnov significantly improved compression performance, Janakarajan Natarajan and Volker Simonis ported the Chromium optimisations to cloudflare-zlib, and Nigel Tao wrote The Fastest, Safest PNG Decoder in the World (a great read if you missed it).

Eric Bigger’s libdeflate also deserves more mention in this post, and should have been included in the graphs – I was looking mainly at zlib-compatible projects, but it is significantly faster than zlib-cloudflare. zlib-dougallj currently outperforms libdeflate version 1.13 on M1 at ~1.17x speed, but I didn’t test other platforms.

My changes seem to be a speedup regardless of file, which I hadn’t imagined was possible. There’s still a fair bit of room for further work here – I didn’t even look at the x86 disassembly, but hopefully this can save some time and energy. The code is on Github, under the zlib license, but has not been thoroughly tested. I’m @dougallj on Twitter.

Graph of performance on Canterbury corpus, showing slightly smaller, but still nearly 1.5x speedups.
Graph of performance on Calgary corpus, showing slightly larger, but still around 1.5x speedups.

Update: New graph including more alternatives. Note that libcompression won’t be checking adler32. I should have used zlib-madler as a baseline: 4.85x

Updates in the twitter thread, including ISA-L, and retimed libdeflate, which was updated to incorporate some of these changes after this post. I also uploaded the corpus and the testing procedure.

Parallelising Huffman decoding and x86 disassembly by synchronising non-self-synchronising prefix codes

Variable length non-self-synchronising prefix codes (like x86 instructions and Huffman codes) are hard to decode in parallel, as each word must be decoded to figure out its length, before the next can be found and decoded. I came up with a practical method for finding a synchronisation point in the middle of a run of such prefix-codes (provided the code has a maximum length), allowing a stream to be split into parts that can be processed in parallel.

The fact that x86 tends to eventually synchronise is a fairly common observation (e.g. when disassembling incorrect addresses in a debugger). For example, consider the following instructions:

488B4808          mov rcx,[rax+0x8]
488B4918          mov rcx,[rcx+0x18]
486310            movsxd rdx,dword [rax]
488D3492          lea rsi,[rdx+rdx*4]
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

If we skip two bytes, we first see some incorrect results, but we quickly get back to real instructions:

4808488B          o64 or [rax-0x75],cl
49184863          o64 sbb [r8+0x63],cl
10488D            adc [rax-0x73],cl
3492              xor al,0x92
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

I like to think of decoding at an invalid offset as giving you an arbitrary result with an arbitrary length. Each time there’s a chance you’ll land on a valid offset, but if you don’t you’ll get another arbitrary result and get another chance.

In real-world code this does work, and this is the premise of the method. However, in general, this is not guaranteed. For example, the x86 instruction add [rax], al is encoded as 00 00, so if we had a stream of zeroes, and we started in the wrong place, we’d never get back to the right instructions.

Method

I assume that the prefix code has a maximum length, n, which is 15 for x86 instructions. The idea is to start decoding at n sequential offsets. One of these decoders is guaranteed to be decoding at a valid offset (as invalid offsets are those that are within real words, so there can be at most n-1 contiguous invalid offsets). Each of these decoders will visit a different sequence of offsets in the message, but they’ll hopefully all converge with the valid decoder. The synchronisation point is the smallest offset that every decoder can reach.

With a few tricks, the implementation ends up looking quite different. To avoid reading more of the message than necessary, we repeatedly find the decoder with the lowest offset and advance it. To avoid redundant decoding, if it lands on the same offset as another decoder, we can discard it, and we can just repeat this process until there’s only one decoder remaining. These decoders will always be within n positions of each other, so each can be represented by a single set bit in an n-bit buffer.

This code is simplified – in practice this would need bounds-checking and an iteration limit to bail out on hard-or-impossible-to-synchronise streams to avoid hurting performance.

  size_t find_sync_after(uint8_t *data, size_t offset) {
    assert(2 <= MAX_CODE_LENGTH && MAX_CODE_LENGTH <= 63);
    uint64_t probes = (1ull << MAX_CODE_LENGTH) - 1;
    while (probes != 1) {
      uint64_t size = decode_length(data, offset);
      assert(1 <= size && size <= MAX_CODE_LENGTH);

      offset += 1;
      probes >>= 1;

      probes |= 1ull << (size - 1);

      offset += countTrailingZeros(probes);
      probes >>= countTrailingZeros(probes);
    }
    return offset;
  }

(I also came up with another method to find the first synchronisation point after a given offset p, by decoding backwards through the stream, maintaining a ring-buffer of n target locations (positions after p where a decode starting from the n contiguous locations would land), and continuing until all entries in the ring-buffer converge to the same value, which is the result. This is more complicated and requires more decoding, but I thought it deserved a mention as it’s interesting to think about, and it could be a drop-in replacement for an operation based on self-synchronising codes. [Update: added code for this to the end of the post])

Applications

This can be used to split the executable section of a large x86 binary into parts for multiple threads to work on, guaranteeing that each will see the same instructions as a single thread performing linear disassembly.

Huffman coding is generally latency bound, and decoding multiple streams at once makes it possible to take advantage of instruction-level parallelism (see Fabian Giesen’s Reading bits in far too many ways series), but typically this is only practical in formats designed to have multiple independent streams. From Fabian’s work, it’s clear that this method could be applied to extract parallelism from large blocks of Huffman-coded data, especially where the size is known ahead of time. I have experimented with using this technique for the Huffman-based DEFLATE (zlib/gzip) decoding, which unfortunately does not satisfy these criteria.

A DEFLATE stream typically consists of multiple different blocks, each using a different Huffman encoding, where the block size isn’t specified ahead of time. However, if we can guess an offset around half-way through the block, and find a synchronisation point, we can exploit instruction-level parallelism to decode both halves of the block at the same time.

This introduces a ton of complexity: the second half is decoded into temporary memory, and only after the first half completes can it be copied into place and LZ77 references can be resolved. To make things worse either the first or second half may finish first, and the second half may have overshot the end of the current block, and may need to be discarded, requiring careful error handling. You’d also need some heuristic to guess how big the blocks are, perhaps assuming block sizes tend to be consistent (though currently I just jump to a hardcoded offset).

My experiments are preliminary and extremely dubious, using a single file and hand-tuned constants, but showed a roughly 25% speedup on Apple M1. I’d like to say that this shows that latency barriers in decoding existing file formats will be able to be broken in the future, but this result is reliant on the nature of the input data, and much more thorough measurement is needed to see if such a method can improve performance in general, let alone be beneficial enough to justify the complexity.

Update (2022-08-01): Backwards Disassembly

Duncan Ogilvie pointed out that this is similar to what’s used for backwards disassembly in debuggers, for which my method to find the first synchronisation point starting from a given offset might be better suited. The caveat with these methods on x86 disassembly is that we assume that linear disassembly gives the desired result – this isn’t necessarily the case if there’s data mixed in with code, or overlapping instructions. So the real benefit of this method over others is consistency; we might not get a good answer, and we can’t guarantee we’ll get any answer without scanning to the start of the code segment, but we’ll always get the same answer.

In particular, using this method can lead to cases where you have to find a way to gracefully display to the user that the current instruction-pointer is inside a different instruction in the linear-disassembly, which other heuristics may avoid.

Either way, here’s code for the slower method to find the first synchronisation point starting from a given offset by scanning backwards:

  // this can be rounded up to a power of two so (x % RING_SIZE) is cheaper
  #define RING_SIZE MAX_CODE_LENGTH

  size_t find_first_sync_at(uint8_t *data, size_t offset, size_t data_size) {
    // if we're at the start, we're synchronised
    if (offset == 0)
      return 0;

    assert(RING_SIZE >= MAX_CODE_LENGTH);
    uint8_t targets[RING_SIZE];

    size_t p = offset - 1;
    int match_length = 0;
    size_t last_target = MAX_CODE_LENGTH; // invalid value
    for (size_t i = 0; ; i++) {
      size_t l = decode_length(data, p, data_size);
      assert(1 <= l && l <= MAX_CODE_LENGTH);

      size_t target = (l > i) ? p + l - offset : targets[(i - l) % RING_SIZE];
      assert(0 <= target && target < MAX_CODE_LENGTH);

      targets[i % RING_SIZE] = target;

      match_length = (last_target == target) ? match_length + 1 : 0;
      last_target = target;

      if (match_length == MAX_CODE_LENGTH - 1 || p == 0)
        return offset + target;

      p--;
    }
  }

Faster CRC32 on the Apple M1

CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.

Searching for the state-of-the-art, I found a lot of outdated posts, which is unsurprising for a sixty year old problem. Eventually I found a MySQL blog post from November 2021 that presents the following graph, including M1 figures, and gives us some idea that 30GB/s is considered fast:

In fact, in my own testing of the zlib crc32 function, I saw that it performs at around 30GB/s on the M1, so a little better than the graph, which is promising. Possibly that version has been optimised by Apple?

I wanted to try to implement my own version. So, I started at the obvious place, with a special ARM64 instruction designed for calculating CRC32 checksums: CRC32X. This can produce a checksum of 8-bytes, with a latency of 3 cycles. So, theoretically, using this instruction, we could get 3.2GHz / 3 * 8B = 8.5GB/s. On the other hand, CRC32X has a throughput of one per cycle, so supposing we can avoid being latency bound (e.g. by calculating bits of the CRC in chunks, and then combining them) we could get 3.2GHz / 1 * 8B = 25.6GB/s. That’s maybe a little better than the numbers in the MySQL chart, but this is a theoretical best case, not accounting for the overhead of combining the results. (And, as you may have read or guessed, this is the method used in the MySQL post.)

So, can we do better than CRC32X? The M1 can run eight instructions per cycle, and our best idea so far only runs at one instruction per cycle, so maybe we can. Besides, I already tested zlib, and it already performs at 30GB/s, so I know there’s a better way.

The better way was published by Intel in the 2009 paper Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction. This algorithm has been widely implemented, and has been ported to use equivalent ARM64 instructions, PMULL and PMULL2, e.g. in Chromium (coincidentally committed only hours ago at the time of this writing).

I won’t delve into the maths – I don’t even properly understand it – but the main loop has four independent latency chains that look something like this:

x5 = (uint64x2_t) pmull_lo(x1, x0);
y7 = vld1q_u64((const uint64_t *)(buf));
x1 = (uint64x2_t) pmull_hi(x1, x0);
x1 = veorq_u64(x1, x5);
x1 = veorq_u64(x1, y5);

At a glance, I’d say the latency chain was PMULL2 (3) + EOR (2) + EOR (2) = 7 cycles. However, the M1 can fuse PMULL / PMULL2 instructions with subsequent EOR instructions, giving a single uop with three cycle latency. So we can reshuffle this to [PMULL + EOR] (3) + [PMULL2 + EOR] (3) = 6 cycles. (This is ideal for maximising throughput, as it minimises fused uop count, but if you want to minimise latency you can take the PMULL2 off the critical path, giving [PMULL + EOR] (3) + EOR (2) = 5 cycles.)

So we know each chain will have a latency of 6 cycles and have 2 (fused) uops. These uops have a throughput of 4 per cycle, so we can calculate how many independent chains we could benefit from. Over 6 cycles, we can run 6*4 = 24 uops. Since each chain needs 2 uops, we should benefit from having as many as 24/2=12 independent chains – three times more than the 2009 paper used, but also three times more than the more recent implementations I’ve seen.

To be wildly optimistic, if SIMD throughput were the bottleneck, this could run at 0.5 cycles per 16-byte register. 3.2GHz / 0.5 * 16B = 102GB/s. However, that SIMD throughput requires we sustain the maximum of eight unfused instructions per cycle, which would leave no time to load values from memory. Since we’ll need 1 load uop for every four unfused uops (for a total of five out of the eight possible unfused uops per cycle), a more realistic estimate of the frontend limit is 3.2GHz / (5/8) * 16B = 82GB/s.

(By contrast, if we only process 4*16B = 64B per iteration, matching the paper, and have a 6 cycle critical path, we could at most reach 3.2GHz / 6 * 64B = 34GB/s. I believe this is roughly what the zlib function was doing.)

Implementing this is mostly fairly straightforward – stepping in 192 byte increments and copying code to add more latency chains, but it does requires computing new values for k1 and k2, which I did by calling the private zlib function x2nmodp:

uint64_t k1 = (uint64_t)x2nmodp(12*128+32, 0) << 1; // 0x1821d8bc0
uint64_t k2 = (uint64_t)x2nmodp(12*128-32, 0) << 1; // 0x12e968ac4

The resulting code runs at about 70GB/s on the M1, reaching up to 75GB/s if the assembly is adjusted to always have valid fusion pairs. There’s probably still some room for improvement, but I’m pretty happy with that.

My test code is available, although it is not suitable for real-world use as-is.

Update (2022-06-06): This post glosses over the distinction between the two widely used variants of 32-bit CRC (known as CRC32 and CRC32C), because it’s inconsequential – they differ in only the polynomial, so you can use this technique for either with the same performance. Aarch64 also provides both CRC32CX and CRC32X instructions, so you can also use the hardware accelerated techniques for either (although not arbitrary polynomials), again with the same performance.

There are a couple of caveats I didn’t discuss. Firstly, as far as I know, the PMULL + EOR fusion that allows us to run at almost eight SIMD instructions per cycle is unique to Apple designed CPUs, so this is unlikely to be nearly as competitive on other CPUs. The other big risk with this method is that it uses more operations per 16-byte-chunk than the CRC32X instruction technique (3 for CRC32X, vs 5 for this technique). So for small chunks of data, where the CPU can do other work at the same time as the checksum, this is likely to hurt performance, and only on larger chunks does the speed difference really matter. My profiling is all done with 128KB of data or more.

I wrote a slightly cleaner/faster version using inline assembly and C++ templates to remove magic numbers. This reaches ~76GB/s. I also messed with methods for handling unaligned start/end chunks efficiently without reading out of bounds, but that ended up being a huge mess trying to work around problems in clang’s code generation and handle small sizes efficiently – the rough technique for larger buffers is here.

Update (2022-08-01): For Intel CPUs see Pete Cawley’s Faster CRC32-C on x86 – in theory the same idea of mixing both CRC and PMULL instructions should allow even better performance on M1, by decreasing front-end pressure, though I haven’t tried it.

Update (2022-08-08): And for the maths behind this method, see Pete Cawley’s An alternative exposition of crc32_4k_pclmulqdq.

Converting integers to fixed-width strings faster with Neon SIMD on the Apple M1

I was inspired by Daniel Lemire’s blog post, Converting integers to fix-digit representations quickly (and the follow up Converting integers to decimal strings faster with AVX-512) to try solving the same problem using the Neon SIMD instructions on the Apple M1, with impressive performance results, which I attribute mostly to the hardware. (I previously measured and documented the performance of most of these instructions.)

The problem posed is to convert any integer from 0 to 9999999999999999 to a 16-digit string, with leading zeroes if required. On its own, this isn’t terribly useful, but it’s a precise problem to study (algorithms with variable length outputs are much more likely to have their performance vary with their input data), and it’s a useful building-block for more general integer-to-string implementations.

Lemire’s first post also presents this table, with a conspicuous gap for SIMD on M1, which I have filled. For more details on the other methods, see the aforementioned post:

functionApple M1, LLVM 12AMD Zen 2, GCC 10
linear14 ns25 ns
backward linear7.7 ns18 ns
tree6.9 ns15 ns
Khuong3.3 ns8.0 ns
small table3.7 ns7.1 ns
SIMD0.88 ns4.8 ns
big table1.5 ns2.9 ns

I started out by writing a Neon implementation of Paul Khuong and John Wittrock’s excellent SWAR (SIMD-within-a-register) method, described in the blog post How to print integers really fast, and then I iterated on the method to gradually make it more Neon-friendly and work around compiler issues.

Neon implementation (text version on gist)

There are some fun tricks, but it’s mostly just the same old reciprocal division, adjusted to take advantage of the SQDMULH instruction. The existing SWAR implementation does something like the following a few times:

  uint64_t bot = merged - 100ULL * top;
  uint64_t hundreds = (bot << 16) + top;

By reversing the packing order (i.e. shifting top left instead of bot), I get:

  uint64_t hundreds = merged - 100ULL * top + (0x10000 * top);

Which folds to:

  uint64_t hundreds = merged + top * (-100 + 0x10000);

Which is a single multiply-add operation. Unfortunately, some digits are now backwards, but I use a final REV64 instruction to rectify it, which works out nicely. (This optimisation may also be applicable to the SWAR method, if a cheap byte-reverse operation is available, and you’re willing to use it.)

This is able to process a value in about 2.8 cycles on the M1. This takes 0.88 ns per value on the M1 Max in High Power mode (or 1.1 ns in Low Power):

khuong      3.43233
neon        0.875787
backlinear  7.61937
linear     13.2977
tree        6.82174
treetst     6.92956
treest      3.65527
treebt      1.47125

I believe this is a good deal faster than the AVX-512 implementation (Lemire’s post gives 2.5 ns on a 3.30GHz Tiger Lake). How can 128-bit Neon beat 512-bit AVX? Tiger Lake is designed to run at speeds up to 5GHz, so it’s at a disadvantage running at 3.3GHz. However, running at 5GHz would only get to around 1.6 ns, which is still a huge difference.

On the other hand, the M1’s four-wide execution of 128-bit operations is equivalent to Tiger Lake’s one-wide execution of 512-bit operations (in terms of per-cycle throughput), but the M1 is more flexible, since it can be doing four different things each cycle. The Neon solution runs 10 SIMD instructions (for a theoretical maximum throughput of 2.5 cycles on the M1), whereas the AVX-512 solution runs four SIMD multiplies and a shuffle (which I think has a theoretical maximum throughput of 4 to 6 cycles on Tiger Lake?), so this flexibility seems to be the most likely explanation. Intel SIMD instructions are also more likely to contend with scalar instructions for ports, which may be another factor.

A few caveats: this isn’t a controlled experiment – I’m comparing different algorithms on different CPUs, although this is somewhat necessary when comparing very different SIMD ISAs. I couldn’t port the AVX-512 code to Neon, but I didn’t try to port the Neon code back to x86 SIMD, so it’s not impossible there are gains to be had there. The benchmark measures throughput, and ignores latency, which is generally the right thing to do, but still good to keep in mind. The benchmark also largely ignores the cost of loading constant values into registers, as this initialisation is hoisted out of the benchmark loop. Again, this is probably the right thing to do, but may significantly change relative performance if the function isn’t inlined into a loop, or if it is inlined into a loop with significant register pressure. I haven’t bothered to optimise my code for register usage or initialisation speed, but I believe there’d be some easy wins by using the “by element” forms of the SQDMULH and MLA instructions to pack multiple constants into one register.

Code is in this gist – this is a small addition to the benchmarking code from Lemire’s earlier post.

Bit-Twiddling: Optimising AArch64 Logical Immediate Encoding (and Decoding)

I came up with a (seemingly) new method to encode bitmask immediate values on ARM64. This really isn’t worth optimising – clarity and verifiability are more important – but it’s a fun bit-twiddling problem, and the solution I came up with is a bit shorter and ~30% faster than the best existing algorithms I could find (at least in my rough tests – value dependence may complicate things).

Defining Logical Immediates

I found the description of logical immediate values in the specification very confusing, but with the help of Dominik Inführ’s post Encoding of Immediate Values on AArch64, and especially the linked list of all possible encodings, I was able to make sense of it, and come up with my own description that makes this algorithm a bit more evident.

I describe a logical immediate as a pattern of o ones preceded by z more-significant zeroes, repeated to fill a 64-bit integer. o > 0, z > 0, and the size (o + z) is a power of two in [2,64] (meaning the value has at least one zero-bit and one-bit, and repeats to fill the 64-bit integer without any truncation). This part of the pattern is encoded in the fields imms and N. Additionally, the field immr encodes a further right rotate of the repeated pattern. This allows a wide range of useful values to be represented, although notably not an all-zeroes or all-ones pattern.

(The specification describes this right-rotate as rotating the pattern of ones and zeroes before repetition. However, as it’s a repeating pattern, rotating the 64-bit value afterwards is equivalent, and a 64-bit rotate is usually available as a single instruction.)

Encoding

At a high level, my method involves finding the rotation, doing the inverse of the rotation, to normalise the bit pattern, then finding the number of ones and size, and using knowledge of the size to check the pattern is repeating.

I first reject all-zero and all-one bit patterns, then remove the immr rotation. Prior to the rotation, a decoded value will have a one in the least-significant bit, and a zero in the most-significant bit. So I find a one adjacent to a less significant zero, and rotate the value such that that one is the least-significant bit.

There are a few different code sequences that can accomplish this, but I settled on clearing the trailing ones with (x & (x+1)) and counting the trailing zeroes, then rotating the original value right by that amount. Note that this “trailing zeroes count” must be 0 or 64 for an all-zero value, as we may have cleared all the one bits.

This “normalized” value now has a one in the least significant bit, and a zero in the most significant bit, so we can count leading zeroes and trailing ones to determine z and o respectively. We’ve rejected all-zero and all-one values, so we can be sure that 1 <= z < 64 and 1 <= o < 64, and the size 2 <= o + z <= 64.

The final validation is to rotate the original value by the size o + z, and check that the result is still equal to the original value. This validates that we have a repeating pattern filling the 64-bit value (by effectively comparing each repetition of the pattern to the one next to it).

More confusingly, it also ensures that o + z is a power of two. The only minimal patterns that can repeat to fill a 64-bit value must have a length that is a factor of 64 (i.e. it is a power of two in the range [1,64]). By minimal patterns I mean patterns which do not themselves consist of a repeating pattern. For example, 010101 is a non-minimal pattern of a non-power-of-two length that can pass the rotation test. It consists of the minimal pattern 01. All our patterns are minimal, as they contain only one contiguous run of ones separated by at least one zero.

Finally, we encode the values as per the specification, and we’re done.

So, here’s the code:

bool encodeLogicalImmediate64(uint64_t val, int *N, int *immr, int *imms)
{
    if (val == 0 || ~val == 0)
        return false;

    int rotation = countTrailingZeros64(val & (val + 1));
    uint64_t normalized = rotateRight64(val, rotation & 63);

    int zeroes = nonzeroCountLeadingZeros64(normalized);
    int ones = nonzeroCountTrailingZeros64(~normalized);
    int size = zeroes + ones;

    if (rotateRight64(val, size & 63) != val)
        return false;

    *immr = -rotation & (size - 1);
    *imms = (-(size << 1) | (ones - 1)) & 0x3f;
    *N = (size >> 6);

    return true;
}

bool encodeLogicalImmediate32(uint32_t val, int *N, int *immr, int *imms)
{
    uint64_t val64 = ((uint64_t)val << 32) | val;
    return encodeLogicalImmediate64(val64, N, immr, imms);
}

For comparison’s sake, see Linux’s implementation, Dolphin’s implementation, LLVM’s implementation, CoreCLR’s implementation and LuaJIT’s implementation (which performs best of those tested). Oh, and Chakra’s 100kb hash table implementation (which is similar to binutils’ implementation and HotSpot’s implementation, which build tables at runtime and encode with a binary search).

Decoding

Again, for decoders, a lot of implementations seem to closely follow the spec, which is probably wise – you can look at the spec and see that it does the right thing, and it’s hopefully not performance critical. But it’s another fun bit-twiddling problem, so here’s my best effort:

#define DECODE_FAILURE 0 // not a valid logical immediate

uint64_t decodeLogicalImmediate64(int N, int immr, int imms) {
    static const uint64_t mask_lookup[] = {
        0xffffffffffffffff, // size = 64
        0x00000000ffffffff, // size = 32
        0x0000ffff0000ffff, // size = 16
        0x00ff00ff00ff00ff, // size = 8
        0x0f0f0f0f0f0f0f0f, // size = 4
        0x3333333333333333, // size = 2
    };

    int pattern = (N << 6) | (~imms & 0x3f);

    if ((pattern & (pattern - 1)) == 0)
        return DECODE_FAILURE;

    int leading_zeroes = nonzeroCountLeadingZeros32(pattern);
    int ones = (imms + 1) & (0x7fffffff >> leading_zeroes);
    uint64_t mask = mask_lookup[leading_zeroes - 25];
    return rotateRight64(mask ^ (mask << ones), immr);
}

uint32_t decodeLogicalImmediate32(int N, int immr, int imms) {
    if (N) return DECODE_FAILURE;
    return (uint32_t)decodeLogicalImmediate64(N, immr, imms);
}

I couldn’t come up with a way to repeat values a variable number of times without a loop or a lookup, so I went with a lookup. I was originally thinking of using a multiply, but then I thought of the mask ^ (mask << ones) idea, which creates ones 1-bits in all the right places.

I didn’t benchmark this one, though I imagine it’s a bit faster than common techniques involving loops. I’m pretty happy with this, but I still wonder if I’m missing a good trick.

[Update: I was missing a good trick. Zach Wegner notedfor the decoding lookup table equation, you could use 128b unsigned division with (2^128-1) / (2^2^(6-n) + 1), but that’s a tad ridiculous…“. This is fantastic black magic, and I clearly need to learn more about division tricks for bit-twiddling. To my surprise, it’s almost practical. The division is only ~2% slower than the table on Firestorm. On Coffee Lake, it’s ~3x slower than the table, but it’s still ~2x faster than LLVM’s implementation. Code is in this gist – I was way off with my initial idea about using CNEG, it ended up a bit simpler than that.]

Final Notes

The code in this gist has been modified a little from the tested versions for presentation. The full version is available in a gist.

I hereby place the encoder and decoder code in the public domain, as per the terms of the CC0 license, although if you do use it, I’d love to hear about it. My Twitter is @dougallj.

[Update: Pete Cawley offered the following proof that the rotation check ensures that o + z is a power of two:

(x rotate r) == x implies (x rotate gcd(r, bit width of x)) == x. In our case, r == o+z, and we know from construction of o and z that x rotated by less than o+z is not equal to x. Hence the gcd has to equal r, hence r divides bit width.

As someone with a hobbyist interest in maths, the correctness of the first statement wasn’t immediately apparent to me. Although there’s definitely some mathematical arguments I could try to make, and I’m sure others could offer me simple proofs all the way down to a level I’m comfortable with, I’m entirely satisfied that the equivalence of the two expressions is exhaustively verifiable in a relatively foolproof way by mapping out which bits are ensured to be equal to which other bits by (x rotate r) == x for each r from 0 to 63. [Update: Pete added As for why x == x rot r and x == x rot s implies x == x rot gcd(r, s): x == x rot r implies x == x rot (i * r) for any integer i (by induction). Extended gcd gives gcd(r, s) == i * r + j * s. Then x == (x rot (i * r)) rot (j * s) == x rot (i * r + j * s) == x rot gcd(r, s).]

The statement “x rotated by less than o+z is not equal to x” was intuitive to me, but for the sake of completeness: if we rotate right by one, the rotated ones (that we counted) will begin to overlap the zeroes (that we counted), ensuring (x rotate 1) != x. Rotating further, it’s not until the first one has been moved past the last zero (i.e. we have rotated by o + z places), that the counted zeroes and ones will not interfere with each other, and (x rotate r) == x may be true.]

Apple M1: Load and Store Queue Measurements

Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and measure some of these buffers in the Apple M1 processor’s Firestorm and Icestorm microarchitectures, by timing different instruction patterns.

I’ve measured the size of the load and store queue, discovered that load and store queue entries are allocated when the ops issue from the scheduler, and released once they are non-speculative and all earlier loads and stores have been released. I may have also accidentally found a trick for manipulating load/store alias prediction. And I figured I should write it up, so other people can reproduce it, and/or find mistakes.

Fighting Loads with Loads

The general idea is to time the execution of two independent long-latency operations (instructions, or chains of dependent instructions), with a number of loads or stores between them. Usually these two long-latency operations can run in parallel, but if there are so many loads/stores that some buffer is completely filled, the machine will stall, and will have to wait until space in the buffer is freed to execute subsequent instructions. This method was described first (and much more clearly) in Henry Wong’s blog post Measuring Reorder Buffer Capacity.

Initially, in measuring the M1, I used cache-missing loads as the “two independent long-latency operations” (as did everyone else, to my knowledge).

My result (which very roughly looked like AnandTech’s results) was that 128 loads or 107 stores could run, without stopping the final long-latency load from executing in parallel, but adding one more would cause a stall. Since the first load, 128 other loads, and the last load are in flight at the same time, I’d call this a load queue size of 130. I still believe this to be correct. On the other hand, the loads don’t require store buffers, so I incorrectly called this a 107 entry store queue.

Schedulers and Dispatch Queues

Although it is not the topic of this post, I also mapped out the structure of the dispatch queues and schedulers:

If an op can enter a scheduler, it can leave the scheduler and execute when all its dependencies are ready. If it can enter a dispatch queue, then it will enter a scheduler when there’s room, and until then, the frontend can continue. But if the scheduler and dispatch queue are full, and we have an op that needs to go to that queue, we stall. This means no more ops can enter any dispatch queues until there is room in that queue.

It is worth noting that the load/store scheduler has 48 entries, with a 10 entry dispatch queue.

(Very briefly: These were measured by filling the scheduler with some number of loads or stores, with addresses depending on the first long-latency chain, then finding two points. First, I found the last point at which an independent load could still fit into the scheduler and run, to find the scheduler size. Then, I found the number of extra load/stores, once the scheduler was full, that were required to block independent floating-point operations. That gives the dispatch queue size. Mixing operations makes it possible to figure out which schedulers/queues are separate and which are not. Should I write this up?)

Fighting Stores with Square-Roots

The next idea was to use a chain of long-latency floating point operations instead. Surprisingly, this produces a result that about 329 loads or stores can run between floating point operations, without forcing the two long-latency chains to run completely in parallel. I assume this means that load and store queue entries are being released, and reused, before the first long-latency chain has retired, and we’re hitting another limit. Mixing loads and stores confirms it’s the same limit. (It’s not the topic of this post, but I’ve explored it a bit and called it the “coalescing retire queue”. I believe it is their ROB implementation.)

So at this point I’m guessing that loads and stores can complete (free their load/store queue entries, but not their retire queue entries) once they are non-speculative. I believe this completion is in-order relative to other loads and stores. The approach I used to test this was to have a single, initial load or store operation, with its address depending on the result of the first long-latency operation.

However, for this writeup, I got the same result by instead adding a branch, dependent on the result of the first long-latency operation. This will ensure the loads and stores cannot become non-speculative, and keep their queue entries. In this test we see we can run 188 loads or 118 stores without forcing the two long-latency chains to run completely in parallel. This was initially pretty confusing, since we believe we only have a 130 entry load buffer. So, where did the extra 58 entries come from?

The load/store scheduler has 48 entries, with a 10 entry dispatch queue. If the load/store scheduler and dispatch queue are full, the integer and floating point units can continue operating. But if we hit one more load/store, the machine stalls, as it has nowhere to put the instruction. This explains the 58 extra entries.

By this logic (subtracting 58 for the size of the scheduler and dispatch queue), the store queue has only 60 entries. So why did we think it had 47 more? Because if the 48 entry scheduler is almost full, but has one free entry, then a load can enter the scheduler, and then issue from the scheduler and execute (in parallel with the other long-latency load), but if the scheduler is completely full, it cannot.

So those are my current numbers, 130 load queue entries and 60 store queue entries. The same logic works for Icestorm, where we see 30 load queue entries and 18 store queue entries (with an 18 entry scheduler and a 6 entry dispatch queue).

An Uncomfortable, Surprising Discovery

Update (2022-07-22): I’ve since had trouble reproducing the results in this section on an M1 Max. I suspect this is an issue with hardware or software revisions, but take this section with a grain of salt.

In writing this up for this post, I tried to reproduce the “107 entry store queue” result, but the result I got was 60 entries. This is both the best answer I could hope for (it’s the number I currently think is correct), and the worst possible outcome (I’m writing this up, so that people can reproduce my work, and I’m failing to reproduce my own results).

So what went wrong? Bisecting a little, I found this was caused by refactoring to use X29 as the base address for the second long-latency load (with a variable offset), and using X29 or SP as the base address for the store (with a constant offset). Changing either or both registers back to X3 (even though the values did not change) gave me 107 again.

Processors try to execute loads and stores out of order, which makes things fast, as long as the data read by a load isn’t later changed by a preceding (in program order) store. This is called a memory order violation. When it happens, a processor typically throws away all its work and starts again, which is very expensive. (Apple provides a performance counter for this. It’s called MEMORY_ORDER_VIOLATION and described as “Incorrect speculation between store and dependent load”.) Because this is so expensive, there are predictors that try to figure out when a load and a store might alias, and run them in order instead. You can read more about how Intel approaches this in Travis Downs’ Memory Disambiguation on Skylake writeup.

X29 is typically used as the stack frame base pointer, and I suspect the memory dependency prediction has a special case for this, and makes the load wait until it knows the store doesn’t alias it. The theory is that if we have 60 speculative stores before the load, we can figure out all their addresses, and that the load can go ahead. But if we have 61, we can’t check the last one, so the load will wait.

The same code running on Icestorm also measures the 18 entry store buffer.

I think this explanation makes sense, but it was a surprising reminder that there’s still a lot of mysteries here, and it’d be good to verify it further. I put the code to reproduce this result in a gist if others want to investigate this.

The Data

So, here’s the data. To get a single number, I find the top of the jump (the first point at which it’s executing completely serially) and subtract one. This is the largest number of instructions that execute a measurable amount faster than completely serially. But however you pick, the results are close enough.

(Note that the dependent B.cc is chained using FCMP, as I think FMOV might interfere with the memory operations.)

Final Notes

The difference between measuring using the resource itself (loads+loads), and measuring using another resource (fsqrts+loads) is very clear in both graphs. The 58 instruction difference implies that when we do not have the resources to execute more loads, we can continue move more loads into the scheduler. So I conclude that the resources are allocated late. Similarly, the ~330 limit we can hit implies that this resource can be freed early.

I do not see this kind of pattern when measuring physical register file sizes (e.g. comparing int+int vs int+float), so I believe they are neither allocated late, nor freed early. But there’s a lot of complexity I do not yet understand.

Another approach to portable Javascript Spectre exploitation

Many people, myself included, have held the belief that Spectre exploits need to know, understand, and manipulate microarchitectural details that are specific to a given processor design. Published Spectre PoCs generally use techniques such as cache analysis, and flushing lines from the cache. Although Stephen Röttger has shown such techniques can be portable between architectures and practical from Javascript using only coarse timing side-channels, such precise techniques are not necessary. This post describes the techniques I used in browser-based Spectre proof-of-concepts that should work on any sufficiently out-of-order processor, and that can be amplified to use arbitrarily coarse timers.

A Safari arbitrary-read Spectre proof-of-concept, using amplification, that works as-is on a 2017 Kaby Lake, a 2019 Coffee Lake, an Apple A9X Twister, Apple A12 Vortex, and an Apple M1 Firestorm

In trying to reproduce the results of the two-year-old V8 paper Spectre is here to stay, I used a very coarse model of caches and speculative execution. The theoretical model is as follows:

  • CPUs have a cache, which must store any recently accessed memory, and is much smaller than main memory. Reading from the cache is faster than reading from main memory.
  • CPUs have memory-level parallelism, i.e. two cache misses can be processed at the same time, if the address of both is known. (On the other hand, if the address of one cache-missing load depends on the result of another cache-missing load, the two will be processed one after the other.)
  • If a memory read misses the cache, execution after the read will continue speculatively, with the branch predictor deciding which way branches that depend on the cache-missing value should go. Memory accesses during speculative execution may access and change which lines are in the cache (enabling the previously mentioned memory-level parallelism, as well as the cache side-channel to get information from a speculative context.)
  • A sufficiently smart branch predictor may predict non-random patterns of branches, and a sufficiently smart pre-fetcher may pre-fetch non-random memory access patterns, but (pseudo) random branch and access patterns are sufficient to make both fail (the vast majority of the time).

This is all deliberately very general, so as to be basically true on all recent, high-performance CPUs.

Finally, I noted that a timing side-channel may experience random delays, the length of which can turn a fast result into a slow result. This is unavoidable, and depends on other system activity, so error detection and correction is required. The probabilistic techniques I use can lead to false positives and false negatives, but as long as these are reasonably improbable, error-correction allows us to turn this into a performance problem rather than a correctness problem.

The two main techniques here were based on, and inspired by, a 2013 blog post by Henry Wong, Measuring Reorder Buffer Capacity.

Pigeonhole Eviction

This idea is as self-evident as the pigeonhole principle, but nonetheless it took me a while to internalise, and to think of applying it to Spectre.

If you have N cache-lines of data, and a cache that can hold only M lines, and N > M, then some of your data (at least N-M lines) must have been evicted from the cache.

For example, consider a CPU with a 256 KB L2 cache. By allocating 64MB of data, there is at most a 0.4% (1/256) chance that a randomly chosen cache-line will be in the L2 cache, and a much smaller chance it’ll be in the (smaller) L1 cache.

This is an adequate eviction “technique” for Spectre. I never evict a chosen line, but instead choose a new location at random that is very likely to have already been evicted. (And at any time a line can be moved into the L1 cache by reading or writing that cache line.)

MLP Amplification

Consider the following loop, which follows a pointer-chain. If p always points to a random cache-line within an uncacheably-large buffer, it will run at approximately one cache-miss per iteration.

 loop {
    p = buffer[p]
 }

Now, supposing I add another cache-miss, where buffer[p+1] also points to a random cache-line:

 loop {
    p = buffer[p]
    sum += buffer[buffer[p+1]]
 }

This will still run at at approximately one cache-miss per iteration, because the two cache-misses don’t depend on each other and can take advantage of memory-level parallelism.

Next, imagine that buffer[p+1] doesn’t point to a random location, but instead points to a location a few steps ahead on the pointer-chain. I’ll call this a prefetching load. Each step to a random location will cost approximately one cache-miss, but on any iteration where the line has been fetched into the cache already, there won’t be a cache miss, so the average will be less than one cache-miss per iteration. These two different iteration speeds, depending on a value, are the basis of the side-channel.

The trick is then to make the prefetching load speculative. A bit that you want to leak from speculative execution can be multiplied by a large number and added to buffer[p+1], and the speed of the loop will now be decided by the value of the bit.

This is done by making the loop mispredict an internal branch (e.g. a bounds check) every few (e.g. five to fifteen) iterations, read some an integer it shouldn’t, extract a bit, and, if that value is zero, preload the upcoming links in the pointer chain. Memory level parallelism is typically much wider than two, so my PoC uses five prefetching loads which can significantly shorten the non-triggering iterations.

With this technique, amplification is arbitrary, and controlled by the number of iterations of the loop. However, the strength of the signal may vary depending on the microarchitecture.

The code for this is complicated by the need to run a random number of branch-predictor-training iterations between every speculative read (as the branch may predict correctly if the number is non-random). This also has the limitation that speculative reads that only work with low-probability have a much weaker signal, which may not be distinguishable from noise (i.e. this technique doesn’t allow the speculative read operation to be timed independently of the cache access operation.)

For timing, the simplest portable technique I found was to leak each bit followed by the inverse of that bit, and compare the two times. This halves the speed of the leak, doesn’t correct for all errors, and isn’t usually strictly necessary, but it does allows leaking at low signal strengths, where dynamic CPU clock frequencies might shift one value to start looking like the other during the run. There’s a lot of room for improvement, but it’s a nice way to skip implementing a calibration phase if you’re lazy.

(Note that amplification is not necessary on browsers with SharedArrayBuffer enabled.)

Spectre Gadgets

My initial efforts used an Array out-of-bounds, storing a cache-missing new length to an array, so as to delay resolution of the bounds-check. The Javascript code worked to read out-of-bounds on different CPUs, and on both Chrome and Safari (although the different Array representations in Chrome and Safari made the out-of-bounds read almost useless in Safari). But this was relatively slow, as the store forwarding of the length would sometimes stall speculative execution (or at least that’s my best guess), lowering the signal strength.

Type-confusion (map check bypass) techniques seem to be the most practical avenue. The variants I used are specific to a browser, but it’s essentially a “choose your own type confusion”, using traditional Javascript exploitation techniques, first reading out-of-bounds, then using that information to set up the data for a fake object that allows an arbitrary read.

Very roughly, I create an object which stores its type field a couple of cache-lines away from a target-field. I create a lot of adjacent objects, of a smaller, different type. Two smaller objects will be interpreted as the larger object’s type and target-field, respectively. I rely on pigeonhole eviction to get a randomly chosen target pair that are uncached, then access the target-field-containing smaller object to ensure it’s in the cache. After this, resolution of the speculative type-check will be delayed, and branch-prediction will assume it passes. However, the speculative read of the target-field will not be delayed. This general idea works on both Chrome and Safari, although the details differ. There are fallible steps here, but speculatively reading unmapped memory won’t crash, error-correction can handle rare flipped bits, and amplification automatically takes an average of a large number of attempts.

Last I checked, Firefox tries not to speculate past type-checks (on JS objects), as well as bounds-checks on arrays, but does not mitigate against variant 4 attacks. I was initially rather sceptical of the practicality of variant 4 attacks described in the V8 paper. Specifically, it’s easy to get a mov [rax], rbx ; mov rcx, [rdx] to use a stale value from memory (where rax == rdx, but the value of rdx is available first), but it’s not immediately clear how this can be exploited. It’s much harder to get the more common mov [rsp+0x8], rbx ; mov rcx, [rsp+0x8] to use stale memory. (This, for example, is a pattern Firefox can generate with one instruction in-between, when spilling a value to the stack, which would be rather exploitable). However, in a standalone harness I found that a Coffee Lake processor could do this with >30% reliability depending on the preceding instructions. This appeared to be under a complex set of microarchitectural conditions, which aren’t particularly hard to hit occasionally, but are very tricky to get the processor to repeatably hit with high probability. I agree with their conclusion that this is a real threat, and essentially unable to be mitigated, although I got distracted trying to understand the microarchitectural state, and didn’t try to write a proof-of-concept.

Coding for Side-Channels

Spectre allows us to get data by asking short, stateless questions, and getting a yes-or-no answer back over a possibly-unreliable channel. This is a similar situation to other blind injection vulnerabilities (such as SQLi). But what are the best questions to ask? This is an interesting mathematical problem, closely related to coding theory.

Some coding theory solutions can be applied directly. For example, dynamic Huffman coding can be used to compress the data (and get a result with fewer questions), which I’ve done for blind SQL injection, and should apply here too. Similarly, Hamming codes and repetition codes can be used for error detection and correction, which I tried for Spectre. However, these solutions are unlikely to be optimal, as the reliable and flexible one-way question channel is an unusual advantage.

A Chrome arbitrary-read Spectre proof-of-concept. This does not need or use amplification, but uses hacky, slow error-correction at about 19% efficiency (and capstone.js to disassemble JIT-compiled code, just for fun). It runs as-is on a 2013 Haswell, a 2017 Kaby Lake, a 2019 Coffee Lake, and an Apple M1 Firestorm.

In my standalone experiments, one of the best (and simplest) reliability techniques I’ve found is “ask and verify” (which I first saw in sqlmap). The idea is that, after leaking a word of the message bit-by-bit, we then ask if the word is equal to what we leaked, and if not, we retry. Less reliable channels benefit from repeating the verification a number of times, and also asking for its inverse (so as to detect “always yes” or “always no” burst noise conditions). They also benefit from smaller chunk sizes. More reliable channels can decrease overhead by asking the verifying question for larger chunks of the message. There’s some fun maths here, but I haven’t taken the time to explore it. Maybe it’s already been written about somewhere?

Final Notes

I don’t know if any of this work is novel. I tried to reproduce the results in the two-year-old V8 paper Spectre is here to stay, where they described proof-of-concepts like this. They left out the proof-of-concepts and specific techniques, presumably because they felt that publishing them would do more harm than good. I filled in the gaps, so maybe my solutions are the same. Maybe they aren’t. It feels kinda weird to use an old paper by the V8 developers to write a V8 exploit that works today, but I guess that’s the post-Spectre world.

Similarly, the recent publication of A Spectre proof-of-concept for a Spectre-proof web by Stephen Röttger and Artur Janc on the Google Security blog would be have been unimaginable a few years ago (as their PoC not only bypasses ASLR but allows the contents of memory to be read from Javascript in the latest Chrome). Their work is much more impressive than mine, with techniques that allow exploitation in much less flexible situations, and are also quite portable (given the prevalence of L1 Tree-PLRU caches). I hope to experiment with some of their techniques in the future.

They state “we don’t believe this particular PoC can be re-used for nefarious purposes without significant modifications”. I don’t plan to publish my proof-of-concepts, but I hope the result that arbitrarily-amplifiable Spectre PoCs needn’t be specific to a cache-design, microarchitecture, architecture, or technically even browser is useful to understanding this uniquely persistent vulnerability.

Bitwise conversion of doubles using only floating-point multiplication and addition

(Unfortunately, this is best viewed on non-mobile. I’ll get off WordPress soon.)

In the words of Tom Lehrer, “this is completely pointless, but may prove useful to some of you some day, perhaps in a somewhat bizarre set of circumstances.”

The problem is as follows: suppose you’re working in a programming environment that provides only an IEEE-754 double-precision floating point (“double”) type, and no operations that can access that type’s representation (such as C++ bitwise cast operations, or Javascript’s DataView object). You have a double and you want to convert it to its bitwise representation as two unsigned 32-bit integers (stored as doubles), or vice versa. This problem comes up from time to time, but I was curious about a different question: how restricted can your programming environment be? Could you do it with just floating point multiplication and addition?

Bitwise conversion using floating point operations can be useful in situations like limited interpreted languages, or C++ constexpr contexts. Generally double to int conversion can be done using a binary search, comparing with powers of two to figure out the bits of the exponent. From there the fraction bits can be extracted, either by binary searching more, or using the knowledge of the exponent to scale the fraction bits into the integer range.

But can it be done without bitwise operations, branches, exponentiation, division, or floating point comparisons?

It seemed improbable at first, but I’ve discovered the answer is yes, multiplication and addition are mostly sufficient, although with a few notable caveats. Even without these restrictions different NaN values cannot be distinguished or generated (without bitwise conversion) in most environments, but using only multiplication and addition it is impossible to convert NaN, Infinity or -Infinity into an unsigned 32-bit value. The other problematic value is “negative zero”, which cannot be differentiated from “positive zero” using addition and multiplication. All my code uses subtraction, although it could be removed by substituting a - b with a + (b * -1). And finally, this relies on IEEE-754 operations (in the usual rounding mode, “round to nearest, ties to even”), so it wouldn’t work in environments that use unsafe maths optimisations (the default in shader compilers, or enabled by a flag such as /fp:fast in many other compilers).

So, if you just need a solution, here it is, but otherwise stick around for an explanation:

function double_as_uint32s(double) {
  // Doesn't handle NaN, Infinity or -Infinity. Treats -0 as 0.

  var a = double, b, c, d, e, f, g, h, i, j, k, l, m, n, low, high;

  f=2.2250738585072014e-308+a; j=5e-324; b=j+f; b-=f; m=-5e-324; d=m+b; b=4.4989137945431964e+161; d=b*d; d=b*d; g=d*d;
  d=1.0; g=d-g; h=m+f; f=h-f; f=j+f; f=b*f; f=b*f; f*=f; f=d-f; f*=g; g=-2.2250738585072014e-308+a; h=j+g; h-=g; h=m+h;
  h=b*h; h=b*h; h*=h; h=d-h; c=m+g; c-=g; c=j+c; c=b*c; c=b*c; c*=c; c=d-c; c*=h; k=c*f; c=5.562684646268003e-309*a;
  g=j+c; g-=c; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+c; h-=c; h=j+h; h=b*h; h=b*h; h*=h; h=d-h; g=h*g; h=a*g; g=d-g;
  c=g*c; g=1024.0*g; f=2.0+g; c+=h; h=7.458340731200207e-155*c; l=1.0000000000000002; g=l*h; g=m+g; e=j+g; e-=g; e=b*e;
  e=b*e; c=e*c; e=d-e; g=e*h; c=g+c; e=512.0*e; g=8.636168555094445e-78*c; e+=f; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
  f=b*f; c=f*c; f=d-f; g=f*g; f=256.0*f; c=g+c; e=f+e; f=2.938735877055719e-39*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
  g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=128.0*g; g=5.421010862427522e-20*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
  f=b*f; c=f*c; f=d-f; g=f*g; f=64.0*f; c=g+c; e=f+e; i=2.3283064365386963e-10; f=i*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
  g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=32.0*g; g=1.52587890625e-05*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f;
  c=f*c; f=d-f; g=f*g; f=16.0*f; c=g+c; e=f+e; f=0.00390625*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g;
  f=g*f; c=f+c; f=8.0*g; g=0.0625*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=4.0*f;
  c=g+c; e=f+e; f=0.25*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=g+g; e=f+e; n=0.5;
  f=n*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; e=g+e; f=d-k; g=j+a; g-=a; g=m+g; g=b*g;
  g=b*g; g*=g; g=d-g; h=m+a; a=h-a; a=j+a; a=b*a; a=b*a; a*=a; a=d-a; a*=g; g=f*a; a=d-a; a=e*a; a+=g; e=l*c; e=m+e;
  g=j+e; e=g-e; e=b*e; e=b*e; g=n*c; c=e*c; e=d-e; e*=g; c=e+c; e=4.450147717014403e-308+c; g=j+e; g-=e; g=m+g; g=b*g;
  g=b*g; g*=g; g=d-g; h=m+e; e=h-e; e=j+e; e=b*e; e=b*e; e*=e; e=d-e; e*=g; g=e+e; d-=g; c=d*c; c=b*c; b*=c;
  c=-4503599627370496.0*f; c+=b; b=i*c; b=-0.4999999998835847+b; b=4503599627370497.0+b; d=-4503599627370497.0+b;
  b=2147483648.0*e; a=1048576.0*a; a=b+a; b=d+a; a=-4294967296.0*d; a+=c; low=a; high=b;

  return [low, high];
}

function uint32s_as_double(low, high) {
  var a = low, b = high, c, d, e, f, g, h, i, j, k, l, m;

  b=9.5367431640625e-07*b; f=-0.4999999998835847; c=f+b; g=4503599627370497.0; c=g+c; e=-4503599627370497.0; c=e+c;
  d=b-c; c=0.00048828125*c; b=f+c; b=g+b; k=e+b; l=c-k; j=2.2250738585072014e-308; c=j+l; c-=l; i=4.49423283715579e+307;
  b=i*c; c=1.0; b=c-b; a=2.220446049250313e-16*a; h=-0.00048828125+l; a=d+a; d=b*h; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
  b+=a; b=j*b; m=1.3407807929942597e+154; h=m*h; h=c+h; b=h*b; b*=h; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=m*h; h=c+h;
  b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.157920892373162e+77*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
  h=3.402823669209385e+38*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.8446744073709552e+19*h; h=c+h; b=h*b;
  d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=4294967295.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=65535.0*h; h=c+h;
  b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=255.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=15.0*h; h=c+h;
  b=h*b; d+=d; f+=d; f=g+f; e+=f; d-=e; e=3.0*e; e=c+e; b=e*b; d+=d; d=c+d; b=d*b; d=-0.99951171875+l; e=j+d; d=e-d;
  d=i*d; e=j+a; a=e-a; a=i*a; a=c-a; a=d*a; a=m*a; a=m*a; a-=a; a=b+a; b=k+k; b=c-b; a*=b;

  return a;
}

(I’m mostly joking, but it would be pretty funny to find that code buried in a library someday, and it should be pretty easy to port to just about any language.)

Background: IEEE-754 Doubles

This aims to be a concise explanation of everything you need to know about doubles to understand the rest of the article. Skip it if you know it already.

A double is a 64-bit value. Going from most-significant-bit to least-significant-bit, it is comprised of a 1-bit sign, an 11-bit exponent and a 52-bit fraction. These bits are interpreted as either a special value, or a numerical value as described in the following pseudocode. The operations and values in the pseudocode have infinite precision, and ** is the exponentiation operation.

if (sign == 1)
  s = -1;
else
  s = 1;

if (exponent == 0x7FF) {
  // Maximum exponent means a special value
  if (fraction == 0)
    return NaN;       // Not a Number
  else if (sign == 1)
    return -Infinity;
  else
    return Infinity;
} else if (exponent == 0) {
  // Zero exponent means a subnormal value.
  return s * (0.0 + fraction * (2 ** -52)) * (2 ** -1022);
} else {
  // Everything else is a normal value.
  return s * (1.0 + fraction * (2 ** -52)) * (2 ** (exponent-1023));
}

Normal values have an implicit leading 1, and can be thought of as “1.fraction”. Subnormals do not, so can be thought of as “0.fraction”, but they are otherwise the same as exponent == 1.

This has been carefully designed, and gives a few interesting properties:

  • The implicit leading one ensures that each value has a unique representation (except for 0/-0 and NaN).
  • The subnormals ensure that distance between representable numbers only ever decreases as you get closer to zero, so the difference between two sequential values (also known as a “unit in last place” or ULP) is always exactly representable.
  • For positive numbers, the floating point value increases with its 64-bit integer representation, so they could be compared as integers, or you can find the next representable value by adding 1 to its int64 representation.

Addition and multiplication of doubles is defined as exact, infinitely-precise, mathematical addition and multiplication. If the exact result can be represented by a double, that double is the result, otherwise rounding occurs. IEEE-754 specifies several rounding modes that can be used, but I’ll focus on the most widely used one “round to nearest, ties to even”. This means that the nearest representable value is used, or if the exact result is half way between two representable values, the value with zero in its least significant fraction bit is used. If the infinitely precise result gets too large or too small, it will be rounded to Infinity or -Infinity (see IEEE-754-2008 section 4.3.1 for a formal definition).

Finally, we should consider the special values. If NaN is an input to an addition or multiplication, the result will always be NaN. Multiplication and addition with Infinity or -Infinity will result in other Infinity or -Infinity values, with the exceptions of multiplying Infinity by zero, or subtracting Infinity from Infinity, both of which will result in NaN.

Notation

From this point onward, this is an attempt at something like literate programming, presented in essentially the order I created it, starting with just multiply, add and subtract, then building progressively more powerful functions. The code was written as C++, and has been refactored to simplify the explanation. I do make use of loops and functions, but only where they can be completely unrolled or inlined by the compiler.

I’ve omitted the function double p2(int e), which provides a power of two – everywhere it is used it gets inlined as a constant, but the easiest way to ensure this was to use a lookup table with 2098 values.

The macro CONSTEXPR is defined as follows, mostly to allow adjustments to inlining, or removing the constexpr keyword from everything easily:

#define CONSTEXPR \
  constexpr static inline __attribute__((always_inline))

Throughout this text I’ve used exponent to mean the encoded exponent bits in a double, as opposed to the unbiased/decoded exponent (exponent - 1023). Hopefully that’s not too confusing.

Logic Operations

I started by investigating what you can do with only addition and multiplication. Supposing “true” is 1.0 and “false” is 0.0, I implemented some logic operations:

CONSTEXPR double double_and(double a, double b) {
  return a * b;
}

CONSTEXPR double double_not(double a) {
  return 1 - a;
}

CONSTEXPR double double_or(double a, double b) {
  return a + b - a * b;
}

CONSTEXPR double select(
    double condition, double if_true, double if_false) {
  return condition * if_true + double_not(condition) * if_false;
}

These are mostly presented without further comment, as they can be tested exhaustively. However select is where things get a bit tricky. Because Infinity * 0 = NaN and NaN + anything = NaN, we can never ignore Infinity values and must be meticulous about never performing operations that could create them.

Avoiding Infinities

Given I want to convert an arbitrary floating point number to its bitwise representation, I had to start by figuring out what operations I could do on any floating point number without risking creating an Infinity.

One option here is multiplying by values between 1.0 and -1.0 inclusive as the result will never increase in magnitude. This works in any rounding mode.

We can also add any constant value between p2(969) and -p2(969) exclusive, as this will not round to infinity when added to the positive or negative values of greatest magnitude. However, this only works in round-to-nearest or round-toward-zero modes, as round-toward-positive and round-toward-negative may round to Infinity when adding even the smallest non-zero value.

An Initial Comparison

I figured I would need to construct (x == y) and (x < y) comparisons – something that would give me a boolean 0.0 or 1.0 that I could use with my logic functions. But I couldn’t even come up with a way to compute (x == 0). So I instead started with the question: what boolean value can I create?

Consider floating point addition of the smallest positive value (p2(-1074)) to a number. If exponent (the value of the encoded bits) is zero or one, this value is the ULP (distance between subsequent representable floating point values), so the result will be exact. When exponent is two, the ULP is doubled, so the exact result will land between two representable values, so it will “round to even”, and either round up (adding p2(-1073) instead) or round down (leaving the value unchanged). Finally, if the exponent is four or above, the exact result of the addition will never reach the midpoint between representable values, so rounding to nearest will leave the value unchanged.

That explanation doesn’t completely cover the boundaries between exponent values. Importantly, when adding p2(-1074) to the negative number with exponent two and fraction zero, the result will have exponent one, and therefore is exactly representable (although the same is not true for the corresponding positive number).

So, supposing we compute x + p2(-1074) - x we will get either p2(-1074) * 2 or 0 if there was rounding, or p2(-1074) if the result of the addition is accurate.

This can be turned into a boolean like so:

CONSTEXPR double adding_smallest_is_precise(double x) {
  double add_error = x + p2(-1074) - x;

  // add_error is in {0, p2(-1074), 2 * p2(-1074)}
  add_error -= p2(-1074);

  // add_error is in {-p2(-1074), 0, p2(-1074)}
  // divide by p2(-1074), by multiplying by p2(1074). p2(1074) is
  // out of range, so multiply by its square root twice instead.
  add_error = add_error * p2(1074/2) * p2(1074/2);

  // add_error is in {-1, 0, 1}
  add_error *= add_error;

  // add_error is in {1, 0, 1}
  return double_not(add_error);
}

This function computes -p2(-1021) <= d < p2(-1021), which is enough to start constructing other comparisons.

However, this comparison is frustratingly asymmetric, so we’ll compute -p2(-1021) < d < p2(-1021) as follows. This is equivalent to checking if the exponent is zero or one.

CONSTEXPR double is_exp_0_or_1(double x) {
  double precise_add = adding_smallest_is_precise(x);
  double precise_sub = adding_smallest_is_precise(-x);
  return double_and(precise_add, precise_sub);
}

Equality Comparisons

To start with, it’d be good to compute x == 0. We can now do that by taking the minimum and maximum values that satisfy is_exp_0_or_1 and checking that x + v still satisfies is_exp_0_or_1 for both:

CONSTEXPR double is_zero(double x) {
  double magic = p2(-1021) - p2(-1074);
  return double_and(is_exp_0_or_1(x + magic),
                    is_exp_0_or_1(x - magic));
}

This works, and is Infinity-safe, as the magic number is nowhere near the limit of p2(969). It also gives us a way to implement x == y, by checking is_zero(x - y). However, x - y may be Infinity, so we must first implement a safe subtraction operation for comparisons:

CONSTEXPR double cmp_sub(double x, double y) {
  // return a number with the same sign as x-y (or zero
  // if x-y==0), while avoiding returning infinity.

  double small = double_or(is_exp_around_0_or_1(x),
                           is_exp_around_0_or_1(y));
  double multiplier = (small + 1) * p2(-1);
  return (x * multiplier) - (y * multiplier);
}

If either value has a tiny exponent, then x - y cannot become infinite. However, if both values have an exponent >= 2, multiplying by p2(-1) will be lossless (it just subtracts 1 from the exponent). As such, the result will be zero when x == y, will be positive when x > y and will be negative when x < y. So we can test equality like so:

CONSTEXPR double is_equal(double x, double y) {
  return is_zero(cmp_sub(x, y));
}

Unfortunately, we still don’t have a way to calculate x < 0 (which would give us x < y), but we’ll get back to that later.

Getting the Exponent

If we want to convert a double to its bitwise representation, we’ll need to extract its encoded exponent. So far, we can check if the exponent is zero or one.

We can use that to build a test for if the exponent is zero (i.e. the value is a subnormal), by adding constants that shift values with exponent one outside of the range:

CONSTEXPR double is_exp_0(double x) {
  return double_and(is_exp_0_or_1(x + p2(-1022)),
                    is_exp_0_or_1(x - p2(-1022)));
}

The other thing we want is to multiply by negative powers of two. This will subtract a constant from the exponent (leaving the fraction unchanged), unless the exponent reaches zero, in which case rounding will occur (possibly rounding up to a value with exponent one). This can be used to build tests for if the exponent is less than a given value. For example, is_exp_0_or_1(v * p2(-1024)) will be true if the exponent is less than 1024 + 2.

This can be used to binary search the value of the exponent:

CONSTEXPR double get_encoded_exponent(double v) {
  double tmp = v;
  double e = 0;

  #pragma unroll
  for (int test = 1024; test >= 1; test /= 2) {
    double trial = tmp * p2(-test);
    double too_small = is_exp_0_or_1(trial);

    tmp = select(too_small, tmp, trial);
    e += select(too_small, 0, test);
  }
  
  return select(is_exp_0_or_1(v), double_not(is_exp_0(v)), e + 2);
}

This will check if the encoded exponent is less than 2 + 1024, and if not, it’ll subtract 1024 from the encoded exponent (by multiplying by p2(-1024)), and add 1024.0 to our exponent value. This is repeated with smaller powers of two, until we know that the remaining encoded exponent is 0, 1, or 2, and the e variable will contain the amount subtracted. Finally, it uses the is_exp_0_or_1 and is_exp_0 functions to handle the zero and one cases explicitly.

Complete Comparisons

This is a great step towards bitwise casts, but tmp in get_encoded_exponent is interesting. By the end of the function, we’ve preserved its sign and fraction bits, but its exponent has been converted to only 0, 1, or 2. This makes the challenge of testing x < 0 much simpler.

We can easily define a make_exp_0_or_1 function, that does the same thing, but also halves values that were left with exponent two:

CONSTEXPR double make_exp_0_or_1(double v) {
  double res = v;

  #pragma unroll
  for (int test = 1024; test >= 1; test /= 2) {
    double trial = res * p2(-test);
    res = select(is_exp_0_or_1(trial), res, trial);
  }

  return select(is_exp_0_or_1(res), res, res * p2(-1));
}

Now we can add a constant to shift all non-negative values out of the zero-or-one exponent range, such that only values less than zero pass the is_exp_0_or_1 test.

CONSTEXPR double is_less_than_zero(double v) {
  return is_exp_0(make_exp_0_or_1(v) + p2(-1022));
}

And, using our cmp_sub from earlier, we can compute (x < y):

CONSTEXPR double is_less_than(double a, double b) {
  return is_less_than_zero(cmp_sub(a, b));
}

Floor

The final tool we need before we can put together out bitwise casts is floor. For this, we’ll consider only numbers between zero and p2(52), and we’ll use a trick I’ve seen in the past (e.g. in musl libc’s floor.c). The trick is to add and subtract p2(52). Within the range p2(52) to p2(53), the ULP is exactly 1, so x + p2(52) - p2(52) performs a round-to-nearest-integer operation. From here, we can simply check if it rounded up, and subtract 1 if it did:

CONSTEXPR double small_positive_floor(double v) {
  // WARNING: incorrect for negative numbers and some
  // values over p2(52)
  // (but works for zero despite the name)
  double r = v + p2(52) - p2(52);
  return select(is_less_than(v, r), r - 1, r);
}

This lets us extract specific bits from a floating point integer. Specifically, I use the following idiom to split n low bits from an integer x: high_part = floor(x * p2(-n)); low_part = x - high_part * p2(n);

Double to bits

So, how close are we to converting a double to its bits? get_encoded_exponent gives us the exponent bits. is_less_than_zero gives us the sign bit.

For the fraction, make_exp_0_or_1 has given us all the fraction bits, but preserved the sign, and the implicit leading 1 if the number isn’t subnormal.

We can clear the sign bit by multiplying by -1 if the value is negative. We can subtract the implicit leading 1 if the value isn’t subnormal to be left with only the fraction bits, and then scale it up by p2(1047) so that a fraction of 1 is 1.0:

CONSTEXPR double get_fraction(double v) {
  double result = make_exp_0_or_1(v) *
                  select(is_less_than_zero(v), -1, 1);
  result -= select(is_exp_0(v), 0, p2(-1022));
  result = result * p2(1074 / 2) * p2(1074 / 2); 
  return result;
}

This gives us a 1-bit sign value, an 11-bit exponent value, and a 52-bit fraction value (all stored as integers within doubles), so we just need to split that into two 32-bit values.

These traditionally bitwise ops are written using multiplication by powers of two as a constant shift (with floor to truncate the result), addition to set bits (instead of bitwise “or”), and subtraction to clear bits (instead of bitwise “and”):

struct low_high_doubles {
  double low;
  double high;
};

CONSTEXPR struct low_high_doubles constexpr_double_as_ints(double v){
  double sign = is_less_than_zero(v);
  double exponent = get_encoded_exponent(v);
  double fraction = get_fraction(v);

  double high_fraction = small_positive_floor(fraction * p2(-32));
  double high = sign * p2(31) + exponent * p2(20) + high_fraction;
  double low = fraction - high_fraction * p2(32);
  return { low, high };
}

Bits to double

To convert bits to double, we can roughly follow the inverse. This is conceptually a bit simpler, so it’s only explained lightly in the comments:

CONSTEXPR double double_from_sign_exp_fraction(
    double sign, double exponent, double fraction) {
  double exp_is_non_zero = double_not(is_zero(exponent));

  // scale fraction down to exponent 0
  double v = fraction * p2(-1074);

  // add the implicit leading one if needed (so exponent = 1)
  v += select(exp_is_non_zero, p2(-1022), 0);

  // compute how much we need to increment the exponent by
  double e = select(exp_is_non_zero, exponent - 1, 0);

  // shift it so that all but the first bit is after the point
  e *= p2(-10);

  #pragma unroll
  for (int test = 1024; test >= 1; test >>= 1) {
    // cond will be 1 if the relevant bit is set, otherwise 0
    double cond = small_positive_floor(e);

    // clear the current bit and shift the next bit into the
    // ones place
    e = (e - cond) * 2;
    if (test == 1024) {
      // p2(1024) is unrepresentable, so multiply by its
      // square root twice
      v *= select(cond, p2(512), 1.0);
      v *= select(cond, p2(512), 1.0);
    } else {
      v *= select(cond, p2(test), 1.0);
    }
  }

  // generate a NaN value if one is expected.
  double is_nan = double_and(is_equal(exponent, 2047),
                             double_not(is_zero(fraction)));

  // if it's a NaN, "v" will already be Infinity, so multiply by
  // zero to make it NaN, otherwise multiply by one to leave it
  // as-is.
  v *= double_not(is_nan);

  // set the sign bit
  v *= select(sign, -1, 1);

  return v;
}

Finally, we just need to extract the sign, exponent and fraction fields from the high and low unsigned 32-bit integers:

CONSTEXPR double constexpr_ints_as_double(double l, double h) {
  double exp_and_sign = small_positive_floor(h * p2(-20));

  double sign = small_positive_floor(h * p2(-31));
  double exponent = exp_and_sign - sign * p2(11);

  double fraction = (h - exp_and_sign * p2(20)) * p2(32) + l;

  return double_from_sign_exp_fraction(sign, exponent, fraction);
}

The code presented above is true to my initial implementation, but ends up quite bloated, compiling to around 5000 add, subtract or multiply operations (assuming it’s all inlined and unrolled). You can see it on Compiler Explorer or gist.

“Dirty” floor trick

Perhaps that would be a good place to leave it, but I tried to optimise the number of operations it a little. To decrease the size to something comparable to what’s shown in the initial Javascript (around 368 operations), a number of less safe or less clear functions and techniques are used.

The biggest problem is floor, which requires the make_exp_0_or_1 operation every time (binary searching the exponent takes a fair number of instructions). In every situation we use “floor” we know a lot about the range of the value, and the number of bits present after the point. This lets us implement floor without a comparison, by just biasing the input numbers such that round-to-nearest-ties-to-even will round down.

CONSTEXPR double dirty_floor(double v) {
  // for values between 0 and 0x100000 with up to 32 significant bits
  // after the "decimal" point.
  return v - (0.5 - p2(-33)) + (p2(52)+1) - (p2(52)+1);
}

This might be the most complex trick, so to explain a little more: ignoring edge cases we could say that floor(x) == roundToNearestEven(x - 0.5). But the “edge case” here is integers, which will end up exactly between two integers, so round-to-even will round half of all integers down, giving the wrong result.

We can get the right result by subtracting slightly less than 0.5 instead. How much less? Well, it can’t make any other value land on 0.5, so it must be smaller than the smallest distance between possible inputs. But it also can’t get rounded off, so it must be at least the ULP for the biggest possible input.

This is impossible to solve if you have 53 significant bits, but fortunately we don’t. The constant chosen works out exactly for our 52-bit fraction being shifted right by 32, and happens to work everywhere else, as there are both fewer significant bits and no larger values.

More tweaks

Revisiting the initial comparison, a cheaper symmetrical boolean test was found. This computes -p2(-1021) <= d <= p2(-1021) (i.e. the same as is_exp_0_or_1 but including one value on either side).

CONSTEXPR double is_exp_around_0_or_1(double v) {
  double biased = v - p2(-1074);
  return (biased + p2(-1074) - biased) * p2(1074 / 2) * p2(1074 / 2);
}

(This can be analysed case-by-case, but essentially the initial bias both makes it symmetrical, and prevents a subsequent round-to-even from ever rounding away from the biased value, simplifying the conversion to boolean.)

We can go a bit further to try to replace is_exp_0_or_1 by multiplying the input by the smallest double greater than one. Unfortunately, this can generate Infinity when called on arbitrary values, but we can use it on all but the first iteration of our exponent decreasing loops.

CONSTEXPR double unsafe_is_exp_0_or_1(double v) {
  // only works for non-huge numbers
  return is_exp_around_0_or_1(v * (p2(0) + p2(-52)));
}

We can use much coarser comparisons when we know a number is either zero or a long way from zero, as we do when comparing the “exponent” or “fraction” values:

CONSTEXPR double is_integer_zero(double v) {
  return (v + p2(-1022) - v) * p2(1022);
}

CONSTEXPR double is_small_integer_equal(double a, double b) {
  return is_integer_zero(a - b);
}

Despite the names, I can and did use these method on non-integers without worrying, when I knew they were well above roughly p2(-900) (around which we might have to worry about the addition being accurate for a non-zero value).

Finally, there were just a lot of little simplifications that the compiler cannot perform. A lot of duplicate work was removed by computing sign, exponent and fraction at the same time in one big function. Throughout the code, select(cond, x, y) with constant x and y could often be written as (x - y) * cond + y, which simplifies even further if y is zero. And there were plenty of other algebraic simplifications of little note.

You can find my optimised code on Compiler Explorer or gist. (Although this doesn’t quite match the code in this post, it should match the Javascript at the top closely.)

The Javascript was generated by compiling the optimised code with clang++ -O2 -fno-slp-vectorize -march=skylake -std=c++14 -fomit-frame-pointer -S, which generated an assembly file containing a stream of vaddsd, vsubsd and vmulsd instructions, as well as vmovsd instructions to load constants. These instructions were translated into Javascript using a terrible Python script.

Future work

As noted, this was a completely pointless exercise, but it does open up some avenues for further pointless exercises:

  • Can it be generalised to work for float as well (splitting to two 16-bit values)?
  • Can it be extended to other rounding modes? All other rounding modes?
  • Are there simpler or smaller implementations of the various operations used?
  • Could it be turned into a reasonable expression, with no variables, just nested additions and subtractions? Doing so naively gives multi-gigabyte results, but no effort was made to optimise for this.
  • This roughly shows that any function from finite doubles to finite doubles can be implemented. How hard is it to approximate division? How many operations would it take to implement correctly rounded division?

I’d also like to generate a version of the Javascript where all the constants are synthesised from “2.0” and “0.5”, so as to try to be portable to restricted environments with potentially inaccurate floating-point constant parsing.

Further Reading

As I was mostly exploring this for fun, I used very little by way of references, but here are a couple of somewhat related things I quite like:

  • mov is Turing Complete (Stephen Dolan) and the movfuscator (Christopher Domas) (github, talk)
  • Handbook of Floating-Point Arithmetic (Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres)

Anyway, thanks for reading! Let me know if you find any mistakes. You can follow me on Twitter at @dougallj.

Bit-Twiddling: Addition with Unknown Bits

(If you can’t read the code in this post on mobile, try this wrapped version instead, sorry)

Suppose you have two values. You know some bits are zero, some bits are one, other bits could be either. You can add together those two values to get another such partially-known value, determining as precisely as possible which bits can be known and which cannot. This operation is common in optimising compilers and program analysis, and it has some interesting properties that I’ll circle back to, but first here’s my implementation of the addition operation described above:

  struct known_bits {
    unsigned ones;
    unsigned unknowns;
  };

  struct known_bits kb_add(struct known_bits a,
                           struct known_bits b) {
    struct known_bits result;
    unsigned x = a.ones + b.ones;
    result.unknowns = a.unknowns | b.unknowns |
                      (x ^ (x + a.unknowns + b.unknowns));
    result.ones = x & ~result.unknowns;
    return result;
  }

This function compiles to about 12 instructions, the assembly for which fits in a tweet. But let’s look at the theory behind it all.

Representation

A perhaps surprising amount of thought went into that simple-looking structure.

Each partially known value is a ternary number (with digits zero, one and unknown), so for a 32-bit number, there are 332 possible values. This requires more than one 32-bit value to represent, but can be comfortably represented with two or three 32-bit values. Initially, I stored “known ones” and “mask”, where the mask was set for all known bits. After figuring out most of my functions, I realised I could remove a significant number of bitwise nots by storing the unknown bits instead of known bits.

I could also store a third value, “known zeroes”, but this can be calculated if needed by the expression ~(v.ones | v.unknowns), and it seemed error prone to make the representation more redundant.

Theory

But that’s enough about code – let’s get back to some maths. Each value can be thought of as a lossy representation of a set of possible numbers. The size of this set is given by 2popcount(unknowns) (meaning if there are no unknown bits it represents a single value, or if there are two unknown bits it represents four possible values).

The representation is lossy. It would take 232 bits to represent an arbitrary subset of the 32-bit integers, and I’m only storing 64 bits. For example, consider the set {000, 011}. This is represented as 0UU (maximising for precision), which represents the superset {000, 001, 010, 011}.

In some sense, the most fundamental operation is the “union” operation:

   r.ones = a.ones & b.ones;
   r.unknowns = a.unknowns | b.unknowns | (a.ones ^ b.ones);

This combines two sets, such that only bits which are known to be the same are known in the result.

The other fundamental operation is iteration. This is done by iterating over values up to the set size (2popcount(unknowns)), and distributing those bits to the unknown bit locations. This can be written efficiently using a technique described in Fabian Giesen’s Texture tiling and swizzling.

Using these two operations, we can define an algorithm for computing the maximally precise known bits after any operation. As a pseudocode example:

  r = known_value(op(a.ones, b.ones))
  for each possible a as A
    for each possible b as B
      r = union(r, known_value(op(A, B)))

This has an exponential runtime, so it’s entirely impractical, but I quite like it as a mathematical definition of the best possible result, and it can be useful for quickly testing better algorithms.

Some Bitwise Operations

Bitwise operations have relatively simple definitions. A nice example is or:

  r.ones = a.ones | b.ones;
  r.unknowns = (a.unknowns | b.unknowns) & ~r.ones;

Any known ones in either input will be known ones in the output, known zeroes in both inputs will give a known zero output, and anything else is unknown. For example: 00011U | 0U1U1U = 0U111U.

Another example is xor.

  r.unknowns = a.unknowns | b.unknowns;
  r.ones = (a.ones ^ b.ones) & ~r.unknowns;

Any unknowns in either input will be unknown, but any two known values are xored. For example: 00011U ^ 0U1U1U = 0U1U0U.

It’s worth remembering that xor is basically addition without the carry.

Addition Operation

Coming back to addition, there are a few things worth noting. Firstly, our addition is not associative, for example:

  (0001 + 0001) + 000U = 0010 + 000U = 001U
  0001 + (0001 + 000U) = 0001 + 00UU = 0UUU

The addition algorithm shown above comes from a few logical ideas. Firstly, any unknown bit in either input will be unknown in the output (because it can change the result bit in the same location). Secondly, some carry bits may also become unknown, possibly a whole run of them. These unknown bits can be determined by finding difference between the maximum possible sum and the minimum possible sum.

The minimum value in a given set is a.ones and the maximum is a.ones | a.unknowns. So the initial algorithm was:

  struct known_bits kb_add(struct known_bits a,
                           struct known_bits b) {
    struct known_bits result;
    result.unknowns = a.unknowns | b.unknowns |
                      ((a.ones + b.ones) ^ ((a.ones | a.unknowns) +
                                            (b.ones | b.unknowns));
    result.ones = (a.ones + b.ones) & ~result.unknowns;
    return result;
  }

Alternately, the maximum can be represented as (a.ones + a.unknowns) because no bits are set in both a.ones and a.unknowns, so the addition will never carry. This representation allows the following transformation to the function above:

    (a.ones | a.unknowns) + (b.ones | b.unknowns)
  = (a.ones + a.unknowns) + (b.ones + b.unknowns)
  = (a.ones + b.ones) + a.unknowns + b.unknowns

The (a.ones + b.ones) expression now appears in three places, so we calculate it ahead of time to simplify things, giving the function shown at the top.

Questions

Can you prove or disprove the addition function? I can’t find a counter-example when exhaustively searching 8-bit values, but that’s not really a proof.

Can you compute the known-bits for multiplication? It’s easy to start getting useful results, but I would love to see a maximally precise less-than-exponential-time algorithm. Can it be proven that that’s impossible?

Can it be simplified further? John Regehr has had interesting results in program synthesis along these lines (which is what reminded me to post this – thanks!)

Update: if you want to play with implementing different operations, I’ve put some test code in a gist, based on a gist by @zwegner. It includes an implementation of subtraction.