CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
Searching for the state-of-the-art, I found a lot of outdated posts, which is unsurprising for a sixty year old problem. Eventually I found a MySQL blog post from November 2021 that presents the following graph, including M1 figures, and gives us some idea that 30GB/s is considered fast:
In fact, in my own testing of the zlib crc32 function, I saw that it performs at around 30GB/s on the M1, so a little better than the graph, which is promising. Possibly that version has been optimised by Apple?
I wanted to try to implement my own version. So, I started at the obvious place, with a special ARM64 instruction designed for calculating CRC32 checksums: CRC32X. This can produce a checksum of 8-bytes, with a latency of 3 cycles. So, theoretically, using this instruction, we could get 3.2GHz / 3 * 8B = 8.5GB/s. On the other hand, CRC32X has a throughput of one per cycle, so supposing we can avoid being latency bound (e.g. by calculating bits of the CRC in chunks, and then combining them) we could get 3.2GHz / 1 * 8B = 25.6GB/s. That’s maybe a little better than the numbers in the MySQL chart, but this is a theoretical best case, not accounting for the overhead of combining the results. (And, as you may have read or guessed, this is the method used in the MySQL post.)
So, can we do better than CRC32X? The M1 can run eight instructions per cycle, and our best idea so far only runs at one instruction per cycle, so maybe we can. Besides, I already tested zlib, and it already performs at 30GB/s, so I know there’s a better way.
The better way was published by Intel in the 2009 paper Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction. This algorithm has been widely implemented, and has been ported to use equivalent ARM64 instructions, PMULL and PMULL2, e.g. in Chromium (coincidentally committed only hours ago at the time of this writing).
I won’t delve into the maths – I don’t even properly understand it – but the main loop has four independent latency chains that look something like this:
x5 = (uint64x2_t) pmull_lo(x1, x0); y7 = vld1q_u64((const uint64_t *)(buf)); x1 = (uint64x2_t) pmull_hi(x1, x0); x1 = veorq_u64(x1, x5); x1 = veorq_u64(x1, y5);
At a glance, I’d say the latency chain was PMULL2 (3) + EOR (2) + EOR (2) = 7 cycles. However, the M1 can fuse PMULL / PMULL2 instructions with subsequent EOR instructions, giving a single uop with three cycle latency. So we can reshuffle this to [PMULL + EOR] (3) + [PMULL2 + EOR] (3) = 6 cycles. (This is ideal for maximising throughput, as it minimises fused uop count, but if you want to minimise latency you can take the PMULL2 off the critical path, giving [PMULL + EOR] (3) + EOR (2) = 5 cycles.)
So we know each chain will have a latency of 6 cycles and have 2 (fused) uops. These uops have a throughput of 4 per cycle, so we can calculate how many independent chains we could benefit from. Over 6 cycles, we can run 6*4 = 24 uops. Since each chain needs 2 uops, we should benefit from having as many as 24/2=12 independent chains – three times more than the 2009 paper used, but also three times more than the more recent implementations I’ve seen.
To be wildly optimistic, if SIMD throughput were the bottleneck, this could run at 0.5 cycles per 16-byte register. 3.2GHz / 0.5 * 16B = 102GB/s. However, that SIMD throughput requires we sustain the maximum of eight unfused instructions per cycle, which would leave no time to load values from memory. Since we’ll need 1 load uop for every four unfused uops (for a total of five out of the eight possible unfused uops per cycle), a more realistic estimate of the frontend limit is 3.2GHz / (5/8) * 16B = 82GB/s.
(By contrast, if we only process 4*16B = 64B per iteration, matching the paper, and have a 6 cycle critical path, we could at most reach 3.2GHz / 6 * 64B = 34GB/s. I believe this is roughly what the zlib function was doing.)
Implementing this is mostly fairly straightforward – stepping in 192 byte increments and copying code to add more latency chains, but it does requires computing new values for k1 and k2, which I did by calling the private zlib function x2nmodp:
uint64_t k1 = (uint64_t)x2nmodp(12*128+32, 0) << 1; // 0x1821d8bc0 uint64_t k2 = (uint64_t)x2nmodp(12*128-32, 0) << 1; // 0x12e968ac4
The resulting code runs at about 70GB/s on the M1, reaching up to 75GB/s if the assembly is adjusted to always have valid fusion pairs. There’s probably still some room for improvement, but I’m pretty happy with that.
My test code is available, although it is not suitable for real-world use as-is.
Update (2022-06-06): This post glosses over the distinction between the two widely used variants of 32-bit CRC (known as CRC32 and CRC32C), because it’s inconsequential – they differ in only the polynomial, so you can use this technique for either with the same performance. Aarch64 also provides both CRC32CX and CRC32X instructions, so you can also use the hardware accelerated techniques for either (although not arbitrary polynomials), again with the same performance.
There are a couple of caveats I didn’t discuss. Firstly, as far as I know, the PMULL + EOR fusion that allows us to run at almost eight SIMD instructions per cycle is unique to Apple designed CPUs, so this is unlikely to be nearly as competitive on other CPUs. The other big risk with this method is that it uses more operations per 16-byte-chunk than the CRC32X instruction technique (3 for CRC32X, vs 5 for this technique). So for small chunks of data, where the CPU can do other work at the same time as the checksum, this is likely to hurt performance, and only on larger chunks does the speed difference really matter. My profiling is all done with 128KB of data or more.
I wrote a slightly cleaner/faster version using inline assembly and C++ templates to remove magic numbers. This reaches ~76GB/s. I also messed with methods for handling unaligned start/end chunks efficiently without reading out of bounds, but that ended up being a huge mess trying to work around problems in clang’s code generation and handle small sizes efficiently – the rough technique for larger buffers is here.