I was inspired by Daniel Lemire’s blog post, Converting integers to fix-digit representations quickly (and the follow up Converting integers to decimal strings faster with AVX-512) to try solving the same problem using the Neon SIMD instructions on the Apple M1, with impressive performance results, which I attribute mostly to the hardware. (I previously measured and documented the performance of most of these instructions.)
The problem posed is to convert any integer from 0 to 9999999999999999 to a 16-digit string, with leading zeroes if required. On its own, this isn’t terribly useful, but it’s a precise problem to study (algorithms with variable length outputs are much more likely to have their performance vary with their input data), and it’s a useful building-block for more general integer-to-string implementations.
Lemire’s first post also presents this table, with a conspicuous gap for SIMD on M1, which I have filled. For more details on the other methods, see the aforementioned post:
|function||Apple M1, LLVM 12||AMD Zen 2, GCC 10|
|linear||14 ns||25 ns|
|backward linear||7.7 ns||18 ns|
|tree||6.9 ns||15 ns|
|Khuong||3.3 ns||8.0 ns|
|small table||3.7 ns||7.1 ns|
|SIMD||0.88 ns||4.8 ns|
|big table||1.5 ns||2.9 ns|
I started out by writing a Neon implementation of Paul Khuong and John Wittrock’s excellent SWAR (SIMD-within-a-register) method, described in the blog post How to print integers really fast, and then I iterated on the method to gradually make it more Neon-friendly and work around compiler issues.
There are some fun tricks, but it’s mostly just the same old reciprocal division, adjusted to take advantage of the SQDMULH instruction. The existing SWAR implementation does something like the following a few times:
uint64_t bot = merged - 100ULL * top; uint64_t hundreds = (bot << 16) + top;
By reversing the packing order (i.e. shifting top left instead of bot), I get:
uint64_t hundreds = merged - 100ULL * top + (0x10000 * top);
Which folds to:
uint64_t hundreds = merged + top * (-100 + 0x10000);
Which is a single multiply-add operation. Unfortunately, some digits are now backwards, but I use a final REV64 instruction to rectify it, which works out nicely. (This optimisation may also be applicable to the SWAR method, if a cheap byte-reverse operation is available, and you’re willing to use it.)
This is able to process a value in about 2.8 cycles on the M1. This takes 0.88 ns per value on the M1 Max in High Power mode (or 1.1 ns in Low Power):
khuong 3.43233 neon 0.875787 backlinear 7.61937 linear 13.2977 tree 6.82174 treetst 6.92956 treest 3.65527 treebt 1.47125
I believe this is a good deal faster than the AVX-512 implementation (Lemire’s post gives 2.5 ns on a 3.30GHz Tiger Lake). How can 128-bit Neon beat 512-bit AVX? Tiger Lake is designed to run at speeds up to 5GHz, so it’s at a disadvantage running at 3.3GHz. However, running at 5GHz would only get to around 1.6 ns, which is still a huge difference.
On the other hand, the M1’s four-wide execution of 128-bit operations is equivalent to Tiger Lake’s one-wide execution of 512-bit operations (in terms of per-cycle throughput), but the M1 is more flexible, since it can be doing four different things each cycle. The Neon solution runs 10 SIMD instructions (for a theoretical maximum throughput of 2.5 cycles on the M1), whereas the AVX-512 solution runs four SIMD multiplies and a shuffle (which I think has a theoretical maximum throughput of 4 to 6 cycles on Tiger Lake?), so this flexibility seems to be the most likely explanation. Intel SIMD instructions are also more likely to contend with scalar instructions for ports, which may be another factor.
A few caveats: this isn’t a controlled experiment – I’m comparing different algorithms on different CPUs, although this is somewhat necessary when comparing very different SIMD ISAs. I couldn’t port the AVX-512 code to Neon, but I didn’t try to port the Neon code back to x86 SIMD, so it’s not impossible there are gains to be had there. The benchmark measures throughput, and ignores latency, which is generally the right thing to do, but still good to keep in mind. The benchmark also largely ignores the cost of loading constant values into registers, as this initialisation is hoisted out of the benchmark loop. Again, this is probably the right thing to do, but may significantly change relative performance if the function isn’t inlined into a loop, or if it is inlined into a loop with significant register pressure. I haven’t bothered to optimise my code for register usage or initialisation speed, but I believe there’d be some easy wins by using the “by element” forms of the SQDMULH and MLA instructions to pack multiple constants into one register.