Converting integers to fixed-width strings faster with Neon SIMD on the Apple M1

I was inspired by Daniel Lemire’s blog post, Converting integers to fix-digit representations quickly (and the follow up Converting integers to decimal strings faster with AVX-512) to try solving the same problem using the Neon SIMD instructions on the Apple M1, with impressive performance results, which I attribute mostly to the hardware. (I previously measured and documented the performance of most of these instructions.)

The problem posed is to convert any integer from 0 to 9999999999999999 to a 16-digit string, with leading zeroes if required. On its own, this isn’t terribly useful, but it’s a precise problem to study (algorithms with variable length outputs are much more likely to have their performance vary with their input data), and it’s a useful building-block for more general integer-to-string implementations.

Lemire’s first post also presents this table, with a conspicuous gap for SIMD on M1, which I have filled. For more details on the other methods, see the aforementioned post:

functionApple M1, LLVM 12AMD Zen 2, GCC 10
linear14 ns25 ns
backward linear7.7 ns18 ns
tree6.9 ns15 ns
Khuong3.3 ns8.0 ns
small table3.7 ns7.1 ns
SIMD0.88 ns4.8 ns
big table1.5 ns2.9 ns

I started out by writing a Neon implementation of Paul Khuong and John Wittrock’s excellent SWAR (SIMD-within-a-register) method, described in the blog post How to print integers really fast, and then I iterated on the method to gradually make it more Neon-friendly and work around compiler issues.

Neon implementation (text version on gist)

There are some fun tricks, but it’s mostly just the same old reciprocal division, adjusted to take advantage of the SQDMULH instruction. The existing SWAR implementation does something like the following a few times:

  uint64_t bot = merged - 100ULL * top;
  uint64_t hundreds = (bot << 16) + top;

By reversing the packing order (i.e. shifting top left instead of bot), I get:

  uint64_t hundreds = merged - 100ULL * top + (0x10000 * top);

Which folds to:

  uint64_t hundreds = merged + top * (-100 + 0x10000);

Which is a single multiply-add operation. Unfortunately, some digits are now backwards, but I use a final REV64 instruction to rectify it, which works out nicely. (This optimisation may also be applicable to the SWAR method, if a cheap byte-reverse operation is available, and you’re willing to use it.)

This is able to process a value in about 2.8 cycles on the M1. This takes 0.88 ns per value on the M1 Max in High Power mode (or 1.1 ns in Low Power):

khuong      3.43233
neon        0.875787
backlinear  7.61937
linear     13.2977
tree        6.82174
treetst     6.92956
treest      3.65527
treebt      1.47125

I believe this is a good deal faster than the AVX-512 implementation (Lemire’s post gives 2.5 ns on a 3.30GHz Tiger Lake). How can 128-bit Neon beat 512-bit AVX? Tiger Lake is designed to run at speeds up to 5GHz, so it’s at a disadvantage running at 3.3GHz. However, running at 5GHz would only get to around 1.6 ns, which is still a huge difference.

On the other hand, the M1’s four-wide execution of 128-bit operations is equivalent to Tiger Lake’s one-wide execution of 512-bit operations (in terms of per-cycle throughput), but the M1 is more flexible, since it can be doing four different things each cycle. The Neon solution runs 10 SIMD instructions (for a theoretical maximum throughput of 2.5 cycles on the M1), whereas the AVX-512 solution runs four SIMD multiplies and a shuffle (which I think has a theoretical maximum throughput of 4 to 6 cycles on Tiger Lake?), so this flexibility seems to be the most likely explanation. Intel SIMD instructions are also more likely to contend with scalar instructions for ports, which may be another factor.

A few caveats: this isn’t a controlled experiment – I’m comparing different algorithms on different CPUs, although this is somewhat necessary when comparing very different SIMD ISAs. I couldn’t port the AVX-512 code to Neon, but I didn’t try to port the Neon code back to x86 SIMD, so it’s not impossible there are gains to be had there. The benchmark measures throughput, and ignores latency, which is generally the right thing to do, but still good to keep in mind. The benchmark also largely ignores the cost of loading constant values into registers, as this initialisation is hoisted out of the benchmark loop. Again, this is probably the right thing to do, but may significantly change relative performance if the function isn’t inlined into a loop, or if it is inlined into a loop with significant register pressure. I haven’t bothered to optimise my code for register usage or initialisation speed, but I believe there’d be some easy wins by using the “by element” forms of the SQDMULH and MLA instructions to pack multiple constants into one register.

Code is in this gist – this is a small addition to the benchmarking code from Lemire’s earlier post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s