Bit-Twiddling: Optimising AArch64 Logical Immediate Encoding (and Decoding)

I came up with a (seemingly) new method to encode bitmask immediate values on ARM64. This really isn’t worth optimising – clarity and verifiability are more important – but it’s a fun bit-twiddling problem, and the solution I came up with is a bit shorter and ~30% faster than the best existing algorithms I could find (at least in my rough tests – value dependence may complicate things).

Defining Logical Immediates

I found the description of logical immediate values in the specification very confusing, but with the help of Dominik Inführ’s post Encoding of Immediate Values on AArch64, and especially the linked list of all possible encodings, I was able to make sense of it, and come up with my own description that makes this algorithm a bit more evident.

I describe a logical immediate as a pattern of o ones preceded by z more-significant zeroes, repeated to fill a 64-bit integer. o > 0, z > 0, and the size (o + z) is a power of two in [2,64] (meaning the value has at least one zero-bit and one-bit, and repeats to fill the 64-bit integer without any truncation). This part of the pattern is encoded in the fields imms and N. Additionally, the field immr encodes a further right rotate of the repeated pattern. This allows a wide range of useful values to be represented, although notably not an all-zeroes or all-ones pattern.

(The specification describes this right-rotate as rotating the pattern of ones and zeroes before repetition. However, as it’s a repeating pattern, rotating the 64-bit value afterwards is equivalent, and a 64-bit rotate is usually available as a single instruction.)

Encoding

At a high level, my method involves finding the rotation, doing the inverse of the rotation, to normalise the bit pattern, then finding the number of ones and size, and using knowledge of the size to check the pattern is repeating.

I first reject all-zero and all-one bit patterns, then remove the immr rotation. Prior to the rotation, a decoded value will have a one in the least-significant bit, and a zero in the most-significant bit. So I find a one adjacent to a less significant zero, and rotate the value such that that one is the least-significant bit.

There are a few different code sequences that can accomplish this, but I settled on clearing the trailing ones with (x & (x+1)) and counting the trailing zeroes, then rotating the original value right by that amount. Note that this “trailing zeroes count” must be 0 or 64 for an all-zero value, as we may have cleared all the one bits.

This “normalized” value now has a one in the least significant bit, and a zero in the most significant bit, so we can count leading zeroes and trailing ones to determine z and o respectively. We’ve rejected all-zero and all-one values, so we can be sure that 1 <= z < 64 and 1 <= o < 64, and the size 2 <= o + z <= 64.

The final validation is to rotate the original value by the size o + z, and check that the result is still equal to the original value. This validates that we have a repeating pattern filling the 64-bit value (by effectively comparing each repetition of the pattern to the one next to it).

More confusingly, it also ensures that o + z is a power of two. The only minimal patterns that can repeat to fill a 64-bit value must have a length that is a factor of 64 (i.e. it is a power of two in the range [1,64]). By minimal patterns I mean patterns which do not themselves consist of a repeating pattern. For example, 010101 is a non-minimal pattern of a non-power-of-two length that can pass the rotation test. It consists of the minimal pattern 01. All our patterns are minimal, as they contain only one contiguous run of ones separated by at least one zero. (I’ll admit this is a little hand-wavy, but I did exhaustively verify it for 32-bit values.)

[Update: Pete Cawley offered the following explanation, which I’m much happier with:

(x rotate r) == x implies (x rotate gcd(r, bit width of x)) == x. In our case, r == o+z, and we know from construction of o and z that x rotated by less than o+z is not equal to x. Hence the gcd has to equal r, hence r divides bit width.

As someone with a hobbyist interest in maths, the correctness of the first statement wasn’t immediately apparent to me. Although there’s definitely some mathematical arguments I could try to make, and I’m sure others could offer me simple proofs all the way down to a level I’m comfortable with, I’m entirely satisfied that the equivalence of the two expressions is exhaustively verifiable in a relatively foolproof way by mapping out which bits are ensured to be equal to which other bits by (x rotate r) == x for each r from 0 to 63. [Update: Pete added As for why x == x rot r and x == x rot s implies x == x rot gcd(r, s): x == x rot r implies x == x rot (i * r) for any integer i (by induction). Extended gcd gives gcd(r, s) == i * r + j * s. Then x == (x rot (i * r)) rot (j * s) == x rot (i * r + j * s) == x rot gcd(r, s).]

The statement “x rotated by less than o+z is not equal to x” was intuitive to me, but for the sake of completeness: if we rotate right by one, the rotated ones (that we counted) will begin to overlap the zeroes (that we counted), ensuring (x rotate 1) != x. Rotating further, it’s not until the first one has been moved past the last zero (i.e. we have rotated by o + z places), that the counted zeroes and ones will not interfere with each other, and (x rotate r) == x may be true.]

Finally, we encode the values as per the specification, and we’re done.

So, here’s the code:

bool encodeLogicalImmediate64(uint64_t val, int *N, int *immr, int *imms)
{
    if (val == 0 || ~val == 0)
        return false;

    int rotation = countTrailingZeros64(val & (val + 1));
    uint64_t normalized = rotateRight64(val, rotation & 63);

    int zeroes = nonzeroCountLeadingZeros64(normalized);
    int ones = nonzeroCountTrailingZeros64(~normalized);
    int size = zeroes + ones;

    if (rotateRight64(val, size & 63) != val)
        return false;

    *immr = -rotation & (size - 1);
    *imms = (-(size << 1) | (ones - 1)) & 0x3f;
    *N = (size >> 6);

    return true;
}

bool encodeLogicalImmediate32(uint32_t val, int *N, int *immr, int *imms)
{
    uint64_t val64 = ((uint64_t)val << 32) | val;
    return encodeLogicalImmediate64(val64, N, immr, imms);
}

For comparison’s sake, see Linux’s implementation, Dolphin’s implementation, LLVM’s implementation, CoreCLR’s implementation and LuaJIT’s implementation (which performs best of those tested). Oh, and Chakra’s 100kb hash table implementation (which is similar to binutils’ implementation and HotSpot’s implementation, which build tables at runtime and encode with a binary search).

(There are a few variations worth considering. Depending on architecture, working from different ends of the bit-pattern may be cheaper. Early versions normalised the value by counting leading zeroes, shifting left by that amount, then counting leading ones, and then rotating the original value left by the sum of the amounts. Also, the current “count trailing ones” implementation is three instructions on ARM64 (RBIT + MVN + CLZ), but could be implemented with the RBIT + CLS (Count Leading Sign), as we know the number of trailing ones is greater than zero. [Update: I learned that that’s not how CLS works]. Let me know if you have suggestions for other tricks.)

Decoding

Again, for decoders, a lot of implementations seem to closely follow the spec, which is probably wise – you can look at the spec and see that it does the right thing, and it’s hopefully not performance critical. But it’s another fun bit-twiddling problem, so here’s my best effort:

#define DECODE_FAILURE 0 // not a valid logical immediate

uint64_t decodeLogicalImmediate64(int N, int immr, int imms) {
    static const uint64_t mask_lookup[] = {
        0xffffffffffffffff, // size = 64
        0x00000000ffffffff, // size = 32
        0x0000ffff0000ffff, // size = 16
        0x00ff00ff00ff00ff, // size = 8
        0x0f0f0f0f0f0f0f0f, // size = 4
        0x3333333333333333, // size = 2
    };

    int pattern = (N << 6) | (~imms & 0x3f);

    if ((pattern & (pattern - 1)) == 0)
        return DECODE_FAILURE;

    int leading_zeroes = nonzeroCountLeadingZeros32(pattern);
    int ones = (imms + 1) & (0x7fffffff >> leading_zeroes);
    uint64_t mask = mask_lookup[leading_zeroes - 25];
    return rotateRight64(mask ^ (mask << ones), immr);
}

uint32_t decodeLogicalImmediate32(int N, int immr, int imms) {
    if (N) return DECODE_FAILURE;
    return (uint32_t)decodeLogicalImmediate64(N, immr, imms);
}

I couldn’t come up with a way to repeat values a variable number of times without a loop or a lookup, so I went with a lookup. I was originally thinking of using a multiply, but then I thought of the mask ^ (mask << ones) idea, which creates ones 1-bits in all the right places.

I didn’t benchmark this one, though I imagine it’s a bit faster than common techniques involving loops. I’m pretty happy with this, but I still wonder if I’m missing a good trick.

[Update: I was missing a good trick. Zach Wegner notedfor the decoding lookup table equation, you could use 128b unsigned division with (2^128-1) / (2^2^(6-n) + 1), but that’s a tad ridiculous…” (edited to remove an accidental +1). This is fantastic black magic, and I clearly need to learn more about division tricks for bit-twiddling. I’m pretty sure this remains impractical, but with a throughput of one 64-bit division every two cycles on the Apple M1 Firestorm, these kinds of tricks are more practical now than ever before. (If porting this trick to AArch64, without a 128-bit divider, I’d try 64-bit unsigned division, with a CNEG to handle the n=0 case.)

Update 2: To my surprise, it’s almost practical. The division is only ~2% slower than the table on Firestorm. On Coffee Lake, it’s ~3x slower than the table, but it’s still ~2x faster than LLVM’s implementation. Code is in this gist – I was way off with my initial idea about using CNEG, it ended up a bit simpler than that.]

Final Notes

The code in this gist has been modified a little from the tested versions for presentation. The full version is available in a gist.

I hereby place the encoder and decoder code in the public domain, as per the terms of the CC0 license, although if you do use it, I’d love to hear about it. My Twitter is @dougallj.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s