Parallelising Huffman decoding and x86 disassembly by synchronising non-self-synchronising prefix codes

Variable length non-self-synchronising prefix codes (like x86 instructions and Huffman codes) are hard to decode in parallel, as each word must be decoded to figure out its length, before the next can be found and decoded. I came up with a practical method for finding a synchronisation point in the middle of a run of such prefix-codes (provided the code has a maximum length), allowing a stream to be split into parts that can be processed in parallel.

The fact that x86 tends to eventually synchronise is a fairly common observation (e.g. when disassembling incorrect addresses in a debugger). For example, consider the following instructions:

488B4808          mov rcx,[rax+0x8]
488B4918          mov rcx,[rcx+0x18]
486310            movsxd rdx,dword [rax]
488D3492          lea rsi,[rdx+rdx*4]
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

If we skip two bytes, we first see some incorrect results, but we quickly get back to real instructions:

4808488B          o64 or [rax-0x75],cl
49184863          o64 sbb [r8+0x63],cl
10488D            adc [rax-0x73],cl
3492              xor al,0x92
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

I like to think of decoding at an invalid offset as giving you an arbitrary result with an arbitrary length. Each time there’s a chance you’ll land on a valid offset, but if you don’t you’ll get another arbitrary result and get another chance.

In real-world code this does work, and this is the premise of the method. However, in general, this is not guaranteed. For example, the x86 instruction add [rax], al is encoded as 00 00, so if we had a stream of zeroes, and we started in the wrong place, we’d never get back to the right instructions.

Method

I assume that the prefix code has a maximum length, n, which is 15 for x86 instructions. The idea is to start decoding at n sequential offsets. One of these decoders is guaranteed to be decoding at a valid offset (as invalid offsets are those that are within real words, so there can be at most n-1 contiguous invalid offsets). Each of these decoders will visit a different sequence of offsets in the message, but they’ll hopefully all converge with the valid decoder. The synchronisation point is the smallest offset that every decoder can reach.

With a few tricks, the implementation ends up looking quite different. To avoid reading more of the message than necessary, we repeatedly find the decoder with the lowest offset and advance it. To avoid redundant decoding, if it lands on the same offset as another decoder, we can discard it, and we can just repeat this process until there’s only one decoder remaining. These decoders will always be within n positions of each other, so each can be represented by a single set bit in an n-bit buffer.

This code is simplified – in practice this would need bounds-checking and an iteration limit to bail out on hard-or-impossible-to-synchronise streams to avoid hurting performance.

  size_t find_sync_after(uint8_t *data, size_t offset) {
    assert(2 <= MAX_CODE_LENGTH && MAX_CODE_LENGTH <= 63);
    uint64_t probes = (1ull << MAX_CODE_LENGTH) - 1;
    while (probes != 1) {
      uint64_t size = decode_length(data, offset);
      assert(1 <= size && size <= MAX_CODE_LENGTH);

      offset += 1;
      probes >>= 1;

      probes |= 1ull << (size - 1);

      offset += countTrailingZeros(probes);
      probes >>= countTrailingZeros(probes);
    }
    return offset;
  }

(I also came up with another method to find the first synchronisation point after a given offset p, by decoding backwards through the stream, maintaining a ring-buffer of n target locations (positions after p where a decode starting from the n contiguous locations would land), and continuing until all entries in the ring-buffer converge to the same value, which is the result. This is more complicated and requires more decoding, but I thought it deserved a mention as it’s interesting to think about, and it could be a drop-in replacement for an operation based on self-synchronising codes. [Update: added code for this to the end of the post])

Applications

This can be used to split the executable section of a large x86 binary into parts for multiple threads to work on, guaranteeing that each will see the same instructions as a single thread performing linear disassembly.

Huffman coding is generally latency bound, and decoding multiple streams at once makes it possible to take advantage of instruction-level parallelism (see Fabian Giesen’s Reading bits in far too many ways series), but typically this is only practical in formats designed to have multiple independent streams. From Fabian’s work, it’s clear that this method could be applied to extract parallelism from large blocks of Huffman-coded data, especially where the size is known ahead of time. I have experimented with using this technique for the Huffman-based DEFLATE (zlib/gzip) decoding, which unfortunately does not satisfy these criteria.

A DEFLATE stream typically consists of multiple different blocks, each using a different Huffman encoding, where the block size isn’t specified ahead of time. However, if we can guess an offset around half-way through the block, and find a synchronisation point, we can exploit instruction-level parallelism to decode both halves of the block at the same time.

This introduces a ton of complexity: the second half is decoded into temporary memory, and only after the first half completes can it be copied into place and LZ77 references can be resolved. To make things worse either the first or second half may finish first, and the second half may have overshot the end of the current block, and may need to be discarded, requiring careful error handling. You’d also need some heuristic to guess how big the blocks are, perhaps assuming block sizes tend to be consistent (though currently I just jump to a hardcoded offset).

My experiments are preliminary and extremely dubious, using a single file and hand-tuned constants, but showed a roughly 25% speedup on Apple M1. I’d like to say that this shows that latency barriers in decoding existing file formats will be able to be broken in the future, but this result is reliant on the nature of the input data, and much more thorough measurement is needed to see if such a method can improve performance in general, let alone be beneficial enough to justify the complexity.

Update (2022-08-01): Backwards Disassembly

Duncan Ogilvie pointed out that this is similar to what’s used for backwards disassembly in debuggers, for which my method to find the first synchronisation point starting from a given offset might be better suited. The caveat with these methods on x86 disassembly is that we assume that linear disassembly gives the desired result – this isn’t necessarily the case if there’s data mixed in with code, or overlapping instructions. So the real benefit of this method over others is consistency; we might not get a good answer, and we can’t guarantee we’ll get any answer without scanning to the start of the code segment, but we’ll always get the same answer.

In particular, using this method can lead to cases where you have to find a way to gracefully display to the user that the current instruction-pointer is inside a different instruction in the linear-disassembly, which other heuristics may avoid.

Either way, here’s code for the slower method to find the first synchronisation point starting from a given offset by scanning backwards:

  // this can be rounded up to a power of two so (x % RING_SIZE) is cheaper
  #define RING_SIZE MAX_CODE_LENGTH

  size_t find_first_sync_at(uint8_t *data, size_t offset, size_t data_size) {
    // if we're at the start, we're synchronised
    if (offset == 0)
      return 0;

    assert(RING_SIZE >= MAX_CODE_LENGTH);
    uint8_t targets[RING_SIZE];

    size_t p = offset - 1;
    int match_length = 0;
    size_t last_target = MAX_CODE_LENGTH; // invalid value
    for (size_t i = 0; ; i++) {
      size_t l = decode_length(data, p, data_size);
      assert(1 <= l && l <= MAX_CODE_LENGTH);

      size_t target = (l > i) ? p + l - offset : targets[(i - l) % RING_SIZE];
      assert(0 <= target && target < MAX_CODE_LENGTH);

      targets[i % RING_SIZE] = target;

      match_length = (last_target == target) ? match_length + 1 : 0;
      last_target = target;

      if (match_length == MAX_CODE_LENGTH - 1 || p == 0)
        return offset + target;

      p--;
    }
  }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s