Bit-Twiddling: Optimising AArch64 Logical Immediate Encoding (and Decoding)

I came up with a (seemingly) new method to encode bitmask immediate values on ARM64. This really isn’t worth optimising – clarity and verifiability are more important – but it’s a fun bit-twiddling problem, and the solution I came up with is a bit shorter and ~30% faster than the best existing algorithms I could find (at least in my rough tests – value dependence may complicate things).

Defining Logical Immediates

I found the description of logical immediate values in the specification very confusing, but with the help of Dominik Inführ’s post Encoding of Immediate Values on AArch64, and especially the linked list of all possible encodings, I was able to make sense of it, and come up with my own description that makes this algorithm a bit more evident.

I describe a logical immediate as a pattern of o ones preceded by z more-significant zeroes, repeated to fill a 64-bit integer. o > 0, z > 0, and the size (o + z) is a power of two in [2,64] (meaning the value has at least one zero-bit and one-bit, and repeats to fill the 64-bit integer without any truncation). This part of the pattern is encoded in the fields imms and N. Additionally, the field immr encodes a further right rotate of the repeated pattern. This allows a wide range of useful values to be represented, although notably not an all-zeroes or all-ones pattern.

(The specification describes this right-rotate as rotating the pattern of ones and zeroes before repetition. However, as it’s a repeating pattern, rotating the 64-bit value afterwards is equivalent, and a 64-bit rotate is usually available as a single instruction.)

Encoding

At a high level, my method involves finding the rotation, doing the inverse of the rotation, to normalise the bit pattern, then finding the number of ones and size, and using knowledge of the size to check the pattern is repeating.

I first reject all-zero and all-one bit patterns, then remove the immr rotation. Prior to the rotation, a decoded value will have a one in the least-significant bit, and a zero in the most-significant bit. So I find a one adjacent to a less significant zero, and rotate the value such that that one is the least-significant bit.

There are a few different code sequences that can accomplish this, but I settled on clearing the trailing ones with (x & (x+1)) and counting the trailing zeroes, then rotating the original value right by that amount. Note that this “trailing zeroes count” must be 0 or 64 for an all-zero value, as we may have cleared all the one bits.

This “normalized” value now has a one in the least significant bit, and a zero in the most significant bit, so we can count leading zeroes and trailing ones to determine z and o respectively. We’ve rejected all-zero and all-one values, so we can be sure that 1 <= z < 64 and 1 <= o < 64, and the size 2 <= o + z <= 64.

The final validation is to rotate the original value by the size o + z, and check that the result is still equal to the original value. This validates that we have a repeating pattern filling the 64-bit value (by effectively comparing each repetition of the pattern to the one next to it).

More confusingly, it also ensures that o + z is a power of two. The only minimal patterns that can repeat to fill a 64-bit value must have a length that is a factor of 64 (i.e. it is a power of two in the range [1,64]). By minimal patterns I mean patterns which do not themselves consist of a repeating pattern. For example, 010101 is a non-minimal pattern of a non-power-of-two length that can pass the rotation test. It consists of the minimal pattern 01. All our patterns are minimal, as they contain only one contiguous run of ones separated by at least one zero.

Finally, we encode the values as per the specification, and we’re done.

So, here’s the code:

bool encodeLogicalImmediate64(uint64_t val, int *N, int *immr, int *imms)
{
    if (val == 0 || ~val == 0)
        return false;

    int rotation = countTrailingZeros64(val & (val + 1));
    uint64_t normalized = rotateRight64(val, rotation & 63);

    int zeroes = nonzeroCountLeadingZeros64(normalized);
    int ones = nonzeroCountTrailingZeros64(~normalized);
    int size = zeroes + ones;

    if (rotateRight64(val, size & 63) != val)
        return false;

    *immr = -rotation & (size - 1);
    *imms = (-(size << 1) | (ones - 1)) & 0x3f;
    *N = (size >> 6);

    return true;
}

bool encodeLogicalImmediate32(uint32_t val, int *N, int *immr, int *imms)
{
    uint64_t val64 = ((uint64_t)val << 32) | val;
    return encodeLogicalImmediate64(val64, N, immr, imms);
}

For comparison’s sake, see Linux’s implementation, Dolphin’s implementation, LLVM’s implementation, CoreCLR’s implementation and LuaJIT’s implementation (which performs best of those tested). Oh, and Chakra’s 100kb hash table implementation (which is similar to binutils’ implementation and HotSpot’s implementation, which build tables at runtime and encode with a binary search).

Decoding

Again, for decoders, a lot of implementations seem to closely follow the spec, which is probably wise – you can look at the spec and see that it does the right thing, and it’s hopefully not performance critical. But it’s another fun bit-twiddling problem, so here’s my best effort:

#define DECODE_FAILURE 0 // not a valid logical immediate

uint64_t decodeLogicalImmediate64(int N, int immr, int imms) {
    static const uint64_t mask_lookup[] = {
        0xffffffffffffffff, // size = 64
        0x00000000ffffffff, // size = 32
        0x0000ffff0000ffff, // size = 16
        0x00ff00ff00ff00ff, // size = 8
        0x0f0f0f0f0f0f0f0f, // size = 4
        0x3333333333333333, // size = 2
    };

    int pattern = (N << 6) | (~imms & 0x3f);

    if ((pattern & (pattern - 1)) == 0)
        return DECODE_FAILURE;

    int leading_zeroes = nonzeroCountLeadingZeros32(pattern);
    int ones = (imms + 1) & (0x7fffffff >> leading_zeroes);
    uint64_t mask = mask_lookup[leading_zeroes - 25];
    return rotateRight64(mask ^ (mask << ones), immr);
}

uint32_t decodeLogicalImmediate32(int N, int immr, int imms) {
    if (N) return DECODE_FAILURE;
    return (uint32_t)decodeLogicalImmediate64(N, immr, imms);
}

I couldn’t come up with a way to repeat values a variable number of times without a loop or a lookup, so I went with a lookup. I was originally thinking of using a multiply, but then I thought of the mask ^ (mask << ones) idea, which creates ones 1-bits in all the right places.

I didn’t benchmark this one, though I imagine it’s a bit faster than common techniques involving loops. I’m pretty happy with this, but I still wonder if I’m missing a good trick.

[Update: I was missing a good trick. Zach Wegner noted “for the decoding lookup table equation, you could use 128b unsigned division with (2^128-1) / (2^2^(6-n) + 1), but that’s a tad ridiculous…“. This is fantastic black magic, and I clearly need to learn more about division tricks for bit-twiddling. To my surprise, it’s almost practical. The division is only ~2% slower than the table on Firestorm. On Coffee Lake, it’s ~3x slower than the table, but it’s still ~2x faster than LLVM’s implementation. Code is in this gist – I was way off with my initial idea about using CNEG, it ended up a bit simpler than that.]

Final Notes

The code in this gist has been modified a little from the tested versions for presentation. The full version is available in a gist.

I hereby place the encoder and decoder code in the public domain, as per the terms of the CC0 license, although if you do use it, I’d love to hear about it. My Twitter is @dougallj.

[Update: Pete Cawley offered the following proof that the rotation check ensures that o + z is a power of two:

(x rotate r) == x implies (x rotate gcd(r, bit width of x)) == x. In our case, r == o+z, and we know from construction of o and z that x rotated by less than o+z is not equal to x. Hence the gcd has to equal r, hence r divides bit width.

As someone with a hobbyist interest in maths, the correctness of the first statement wasn’t immediately apparent to me. Although there’s definitely some mathematical arguments I could try to make, and I’m sure others could offer me simple proofs all the way down to a level I’m comfortable with, I’m entirely satisfied that the equivalence of the two expressions is exhaustively verifiable in a relatively foolproof way by mapping out which bits are ensured to be equal to which other bits by (x rotate r) == x for each r from 0 to 63. [Update: Pete added As for why x == x rot r and x == x rot s implies x == x rot gcd(r, s): x == x rot r implies x == x rot (i * r) for any integer i (by induction). Extended gcd gives gcd(r, s) == i * r + j * s. Then x == (x rot (i * r)) rot (j * s) == x rot (i * r + j * s) == x rot gcd(r, s).]

The statement “x rotated by less than o+z is not equal to x” was intuitive to me, but for the sake of completeness: if we rotate right by one, the rotated ones (that we counted) will begin to overlap the zeroes (that we counted), ensuring (x rotate 1) != x. Rotating further, it’s not until the first one has been moved past the last zero (i.e. we have rotated by o + z places), that the counted zeroes and ones will not interfere with each other, and (x rotate r) == x may be true.]

Defining Logical Immediates

Encoding

Decoding

Final Notes

Share this:

Related

Leave a comment Cancel reply