Another approach to portable Javascript Spectre exploitation

Many people, myself included, have held the belief that Spectre exploits need to know, understand, and manipulate microarchitectural details that are specific to a given processor design. Published Spectre PoCs generally use techniques such as cache analysis, and flushing lines from the cache. Although Stephen Röttger has shown such techniques can be portable between architectures and practical from Javascript using only coarse timing side-channels, such precise techniques are not necessary. This post describes the techniques I used in browser-based Spectre proof-of-concepts that should work on any sufficiently out-of-order processor, and that can be amplified to use arbitrarily coarse timers.

A Safari arbitrary-read Spectre proof-of-concept, using amplification, that works as-is on a 2017 Kaby Lake, a 2019 Coffee Lake, an Apple A9X Twister, Apple A12 Vortex, and an Apple M1 Firestorm

In trying to reproduce the results of the two-year-old V8 paper Spectre is here to stay, I used a very coarse model of caches and speculative execution. The theoretical model is as follows:

  • CPUs have a cache, which must store any recently accessed memory, and is much smaller than main memory. Reading from the cache is faster than reading from main memory.
  • CPUs have memory-level parallelism, i.e. two cache misses can be processed at the same time, if the address of both is known. (On the other hand, if the address of one cache-missing load depends on the result of another cache-missing load, the two will be processed one after the other.)
  • If a memory read misses the cache, execution after the read will continue speculatively, with the branch predictor deciding which way branches that depend on the cache-missing value should go. Memory accesses during speculative execution may access and change which lines are in the cache (enabling the previously mentioned memory-level parallelism, as well as the cache side-channel to get information from a speculative context.)
  • A sufficiently smart branch predictor may predict non-random patterns of branches, and a sufficiently smart pre-fetcher may pre-fetch non-random memory access patterns, but (pseudo) random branch and access patterns are sufficient to make both fail (the vast majority of the time).

This is all deliberately very general, so as to be basically true on all recent, high-performance CPUs.

Finally, I noted that a timing side-channel may experience random delays, the length of which can turn a fast result into a slow result. This is unavoidable, and depends on other system activity, so error detection and correction is required. The probabilistic techniques I use can lead to false positives and false negatives, but as long as these are reasonably improbable, error-correction allows us to turn this into a performance problem rather than a correctness problem.

The two main techniques here were based on, and inspired by, a 2013 blog post by Henry Wong, Measuring Reorder Buffer Capacity.

Pigeonhole Eviction

This idea is as self-evident as the pigeonhole principle, but nonetheless it took me a while to internalise, and to think of applying it to Spectre.

If you have N cache-lines of data, and a cache that can hold only M lines, and N > M, then some of your data (at least N-M lines) must have been evicted from the cache.

For example, consider a CPU with a 256 KB L2 cache. By allocating 64MB of data, there is at most a 0.4% (1/256) chance that a randomly chosen cache-line will be in the L2 cache, and a much smaller chance it’ll be in the (smaller) L1 cache.

This is an adequate eviction “technique” for Spectre. I never evict a chosen line, but instead choose a new location at random that is very likely to have already been evicted. (And at any time a line can be moved into the L1 cache by reading or writing that cache line.)

MLP Amplification

Consider the following loop, which follows a pointer-chain. If p always points to a random cache-line within an uncacheably-large buffer, it will run at approximately one cache-miss per iteration.

 loop {
    p = buffer[p]

Now, supposing I add another cache-miss, where buffer[p+1] also points to a random cache-line:

 loop {
    p = buffer[p]
    sum += buffer[buffer[p+1]]

This will still run at at approximately one cache-miss per iteration, because the two cache-misses don’t depend on each other and can take advantage of memory-level parallelism.

Next, imagine that buffer[p+1] doesn’t point to a random location, but instead points to a location a few steps ahead on the pointer-chain. I’ll call this a prefetching load. Each step to a random location will cost approximately one cache-miss, but on any iteration where the line has been fetched into the cache already, there won’t be a cache miss, so the average will be less than one cache-miss per iteration. These two different iteration speeds, depending on a value, are the basis of the side-channel.

The trick is then to make the prefetching load speculative. A bit that you want to leak from speculative execution can be multiplied by a large number and added to buffer[p+1], and the speed of the loop will now be decided by the value of the bit.

This is done by making the loop mispredict an internal branch (e.g. a bounds check) every few (e.g. five to fifteen) iterations, read some an integer it shouldn’t, extract a bit, and, if that value is zero, preload the upcoming links in the pointer chain. Memory level parallelism is typically much wider than two, so my PoC uses five prefetching loads which can significantly shorten the non-triggering iterations.

With this technique, amplification is arbitrary, and controlled by the number of iterations of the loop. However, the strength of the signal may vary depending on the microarchitecture.

The code for this is complicated by the need to run a random number of branch-predictor-training iterations between every speculative read (as the branch may predict correctly if the number is non-random). This also has the limitation that speculative reads that only work with low-probability have a much weaker signal, which may not be distinguishable from noise (i.e. this technique doesn’t allow the speculative read operation to be timed independently of the cache access operation.)

For timing, the simplest portable technique I found was to leak each bit followed by the inverse of that bit, and compare the two times. This halves the speed of the leak, doesn’t correct for all errors, and isn’t usually strictly necessary, but it does allows leaking at low signal strengths, where dynamic CPU clock frequencies might shift one value to start looking like the other during the run. There’s a lot of room for improvement, but it’s a nice way to skip implementing a calibration phase if you’re lazy.

(Note that amplification is not necessary on browsers with SharedArrayBuffer enabled.)

Spectre Gadgets

My initial efforts used an Array out-of-bounds, storing a cache-missing new length to an array, so as to delay resolution of the bounds-check. The Javascript code worked to read out-of-bounds on different CPUs, and on both Chrome and Safari (although the different Array representations in Chrome and Safari made the out-of-bounds read almost useless in Safari). But this was relatively slow, as the store forwarding of the length would sometimes stall speculative execution (or at least that’s my best guess), lowering the signal strength.

Type-confusion (map check bypass) techniques seem to be the most practical avenue. The variants I used are specific to a browser, but it’s essentially a “choose your own type confusion”, using traditional Javascript exploitation techniques, first reading out-of-bounds, then using that information to set up the data for a fake object that allows an arbitrary read.

Very roughly, I create an object which stores its type field a couple of cache-lines away from a target-field. I create a lot of adjacent objects, of a smaller, different type. Two smaller objects will be interpreted as the larger object’s type and target-field, respectively. I rely on pigeonhole eviction to get a randomly chosen target pair that are uncached, then access the target-field-containing smaller object to ensure it’s in the cache. After this, resolution of the speculative type-check will be delayed, and branch-prediction will assume it passes. However, the speculative read of the target-field will not be delayed. This general idea works on both Chrome and Safari, although the details differ. There are fallible steps here, but speculatively reading unmapped memory won’t crash, error-correction can handle rare flipped bits, and amplification automatically takes an average of a large number of attempts.

Last I checked, Firefox tries not to speculate past type-checks (on JS objects), as well as bounds-checks on arrays, but does not mitigate against variant 4 attacks. I was initially rather sceptical of the practicality of variant 4 attacks described in the V8 paper. Specifically, it’s easy to get a mov [rax], rbx ; mov rcx, [rdx] to use a stale value from memory (where rax == rdx, but the value of rdx is available first), but it’s not immediately clear how this can be exploited. It’s much harder to get the more common mov [rsp+0x8], rbx ; mov rcx, [rsp+0x8] to use stale memory. (This, for example, is a pattern Firefox can generate with one instruction in-between, when spilling a value to the stack, which would be rather exploitable). However, in a standalone harness I found that a Coffee Lake processor could do this with >30% reliability depending on the preceding instructions. This appeared to be under a complex set of microarchitectural conditions, which aren’t particularly hard to hit occasionally, but are very tricky to get the processor to repeatably hit with high probability. I agree with their conclusion that this is a real threat, and essentially unable to be mitigated, although I got distracted trying to understand the microarchitectural state, and didn’t try to write a proof-of-concept.

Coding for Side-Channels

Spectre allows us to get data by asking short, stateless questions, and getting a yes-or-no answer back over a possibly-unreliable channel. This is a similar situation to other blind injection vulnerabilities (such as SQLi). But what are the best questions to ask? This is an interesting mathematical problem, closely related to coding theory.

Some coding theory solutions can be applied directly. For example, dynamic Huffman coding can be used to compress the data (and get a result with fewer questions), which I’ve done for blind SQL injection, and should apply here too. Similarly, Hamming codes and repetition codes can be used for error detection and correction, which I tried for Spectre. However, these solutions are unlikely to be optimal, as the reliable and flexible one-way question channel is an unusual advantage.

A Chrome arbitrary-read Spectre proof-of-concept. This does not need or use amplification, but uses hacky, slow error-correction at about 19% efficiency (and capstone.js to disassemble JIT-compiled code, just for fun). It runs as-is on a 2013 Haswell, a 2017 Kaby Lake, a 2019 Coffee Lake, and an Apple M1 Firestorm.

In my standalone experiments, one of the best (and simplest) reliability techniques I’ve found is “ask and verify” (which I first saw in sqlmap). The idea is that, after leaking a word of the message bit-by-bit, we then ask if the word is equal to what we leaked, and if not, we retry. Less reliable channels benefit from repeating the verification a number of times, and also asking for its inverse (so as to detect “always yes” or “always no” burst noise conditions). They also benefit from smaller chunk sizes. More reliable channels can decrease overhead by asking the verifying question for larger chunks of the message. There’s some fun maths here, but I haven’t taken the time to explore it. Maybe it’s already been written about somewhere?

Final Notes

I don’t know if any of this work is novel. I tried to reproduce the results in the two-year-old V8 paper Spectre is here to stay, where they described proof-of-concepts like this. They left out the proof-of-concepts and specific techniques, presumably because they felt that publishing them would do more harm than good. I filled in the gaps, so maybe my solutions are the same. Maybe they aren’t. It feels kinda weird to use an old paper by the V8 developers to write a V8 exploit that works today, but I guess that’s the post-Spectre world.

Similarly, the recent publication of A Spectre proof-of-concept for a Spectre-proof web by Stephen Röttger and Artur Janc on the Google Security blog would be have been unimaginable a few years ago (as their PoC not only bypasses ASLR but allows the contents of memory to be read from Javascript in the latest Chrome). Their work is much more impressive than mine, with techniques that allow exploitation in much less flexible situations, and are also quite portable (given the prevalence of L1 Tree-PLRU caches). I hope to experiment with some of their techniques in the future.

They state “we don’t believe this particular PoC can be re-used for nefarious purposes without significant modifications”. I don’t plan to publish my proof-of-concepts, but I hope the result that arbitrarily-amplifiable Spectre PoCs needn’t be specific to a cache-design, microarchitecture, architecture, or technically even browser is useful to understanding this uniquely persistent vulnerability.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s