CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.

Searching for the state-of-the-art, I found a lot of outdated posts, which is unsurprising for a sixty year old problem. Eventually I found a MySQL blog post from November 2021 that presents the following graph, including M1 figures, and gives us some idea that 30GB/s is considered fast:

In fact, in my own testing of the zlib **crc32** function, I saw that it performs at around 30GB/s on the M1, so a little better than the graph, which is promising. Possibly that version has been optimised by Apple?

I wanted to try to implement my own version. So, I started at the obvious place, with a special ARM64 instruction designed for calculating CRC32 checksums: **CRC32X**. This can produce a checksum of 8-bytes, with a latency of 3 cycles. So, theoretically, using this instruction, we could get 3.2GHz / 3 * 8B = 8.5GB/s. On the other hand, **CRC32X** has a throughput of one per cycle, so supposing we can avoid being latency bound (e.g. by calculating bits of the CRC in chunks, and then combining them) we could get 3.2GHz / 1 * 8B = 25.6GB/s. That’s *maybe* a little better than the numbers in the MySQL chart, but this is a theoretical best case, not accounting for the overhead of combining the results. (And, as you may have read or guessed, this is the method used in the MySQL post.)

So, can we do better than **CRC32X**? The M1 can run eight instructions per cycle, and our best idea so far only runs at one instruction per cycle, so maybe we can. Besides, I already tested zlib, and it already performs at 30GB/s, so I *know* there’s a better way.

The better way was published by Intel in the 2009 paper *Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction*. This algorithm has been widely implemented, and has been ported to use equivalent ARM64 instructions, **PMULL** and **PMULL2**, e.g. in Chromium (coincidentally committed only hours ago at the time of this writing).

I won’t delve into the maths – I don’t even properly understand it – but the main loop has four independent latency chains that look something like this:

```
x5 = (uint64x2_t) pmull_lo(x1, x0);
y7 = vld1q_u64((const uint64_t *)(buf));
x1 = (uint64x2_t) pmull_hi(x1, x0);
x1 = veorq_u64(x1, x5);
x1 = veorq_u64(x1, y5);
```

At a glance, I’d say the latency chain was **PMULL2** (3) + **EOR** (2) + **EOR** (2) = 7 cycles. However, the M1 can fuse **PMULL** / **PMULL2** instructions with subsequent **EOR** instructions, giving a single uop with three cycle latency. So we can reshuffle this to [**PMULL** + **EOR**] (3) + [** PMULL2 + EOR**] (3) = 6 cycles. (This is ideal for maximising throughput, as it minimises fused uop count, but if you want to minimise latency you can take the PMULL2 off the critical path, giving [

**PMULL**+

**EOR**] (3) +

**(2) = 5 cycles.)**

**EOR**So we know each chain will have a latency of 6 cycles and have 2 (fused) uops. These uops have a throughput of 4 per cycle, so we can calculate how many independent chains we could benefit from. Over 6 cycles, we can run 6*4 = 24 uops. Since each chain needs 2 uops, we should benefit from having as many as 24/2=12 independent chains – three times more than the 2009 paper used, but also three times more than the more recent implementations I’ve seen.

To be wildly optimistic, if SIMD throughput were the bottleneck, this could run at 0.5 cycles per 16-byte register. 3.2GHz / 0.5 * 16B = 102GB/s. However, that SIMD throughput requires we sustain the maximum of eight unfused instructions per cycle, which would leave no time to load values from memory. Since we’ll need 1 load uop for every four unfused uops (for a total of five out of the eight possible unfused uops per cycle), a more realistic estimate of the frontend limit is 3.2GHz / (5/8) * 16B = 82GB/s.

(By contrast, if we only process 4*16B = 64B per iteration, matching the paper, and have a 6 cycle critical path, we could at most reach 3.2GHz / 6 * 64B = 34GB/s. I believe this is roughly what the zlib function was doing.)

Implementing this is mostly fairly straightforward – stepping in 192 byte increments and copying code to add more latency chains, but it does requires computing new values for **k1** and **k2**, which I did by calling the private zlib function x2nmodp:

```
uint64_t k1 = (uint64_t)x2nmodp(12*128+32, 0) << 1; // 0x1821d8bc0
uint64_t k2 = (uint64_t)x2nmodp(12*128-32, 0) << 1; // 0x12e968ac4
```

The resulting code runs at about 70GB/s on the M1, reaching up to 75GB/s if the assembly is adjusted to always have valid fusion pairs. There’s probably still some room for improvement, but I’m pretty happy with that.

My test code is available, although it is not suitable for real-world use as-is.