Rosetta 2 is remarkably fast when compared to other x86-on-ARM emulators. I’ve spent a little time looking at how it works, out of idle curiosity, and found it to be quite unusual, so I figured I’d put together my notes.
My understanding is a bit rough, and is mostly based on reading the ahead-of-time translated code, and making inferences about the runtime from that. Let me know if you have any corrections, or find any tricks I’ve missed.
Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front. It also supports just-in-time (JIT) translation, but that is used relatively rarely, avoiding both the direct runtime cost of compilation, and any indirect instruction and data cache effects.
Other interpreters typically translate code in execution order, which can allow faster startup times, but doesn’t preserve code locality.
Usually one-to-one translation
[Correction: an earlier version of this post said that every ahead-of-time translated instruction was a valid entry point. While I still believe it would be valid to jump to almost any ahead-of-time translated instruction, the lookup tables used do not allow for this. I believe this is an optimisation to keep the lookup size small. The prologue/epilogue optimisation was also discovered after the initial version of this post.]
Each x86 instruction is translated to one or more ARM instructions once within the ahead-of-time binary (with the exception of NOPs, which are ignored). When an indirect jump or call sets the instruction pointer to an arbitrary offset in the text segment, the runtime will look up the corresponding translated instruction, and branch there.
This uses an x86 to ARM lookup table that contains all function starts, and other basic blocks that are otherwise not referenced. If it misses this, for example while handling a switch-statement, it can fall back to the JIT.
To allow for precise exception handling, sampling profiling, and attaching debuggers, Rosetta 2 maintains a full mapping from translated ARM instructions to their original x86 address, and guarantees that the state will be canonical between each instruction.
This almost entirely prevents inter-instruction optimisations. There are two known exceptions. The first is an “unused-flags” optimisation, which avoids calculating x86 flags value if they are not used before being overwritten on every path from a flag-setting instruction. The other combines function prologues and epilogues, combining push and pop instructions and delaying updates of the stack pointer. In the ARM to x86 address mapping these appear as though they were a single instruction.
There are some tradeoffs to keeping state canonical between each instruction:
- Either all emulated register values must be kept in host registers, or you need load or store instructions every time certain registers are used. 64-bit x86 has half as many registers as 64-bit ARM, so this isn’t a problem for Rosetta 2, but it would be a significant drawback to this technique for emulating 64-bit ARM on x86, or PPC on 64-bit ARM.
- There are very few inter-instruction optimisations, leading to surprisingly poor code generation in some cases. However, in general, the code has already been generated by an optimising compiler for x86, so the benefits of many optimisations would be limited.
However there are significant benefits:
- Generally translating each instruction only once has significant instruction-cache benefits – other emulators typically cannot reuse code when branching to a new target.
- Precise exceptions, without needing to store much more information than instruction boundaries.
- The debugger (LLDB) and works with x86 binaries, and can attach to Rosetta 2 processes.
- Having fewer optimisations simplifies code generation, making translation faster. Translation speed is important for both first-start time (where tens of megabytes of code may be translated), and JIT translation time, which is critical to the performance of applications that use JIT compilers.
Optimising for the instruction-cache might not seem like a significant benefit, but it typically is in emulators, as there’s already an expansion-factor when translating between instruction sets. Every one-byte x86 push becomes a four byte ARM instruction, and every read-modify-write x86 instruction is three ARM instructions (or more, depending on addressing mode). And that’s if the perfect instruction is available. When the instructions have slightly different semantics, even more instructions are needed to get the required behaviour.
Given these constraints, the goal is generally to get as close to one-ARM-instruction-per-x86-instruction as possible, and the tricks described in the following sections allow Rosetta to achieve this surprisingly often. This keeps the expansion-factor as low as possible. For example, the instruction size expansion factor for an sqlite3 binary is ~1.64x (1.05MB of x86 instructions vs 1.72MB of ARM instructions).
(The two lookups (one from x86 to ARM and the other from ARM to x86) are found via the fragment list found in LC_AOT_METADATA. Branch target results are cached in a hash-map. Various structures can be used for these, but in one binary the performance-critical x86 to ARM mapping used a two-level binary search, and the much larger, less-performance-critical ARM to x86 mapping used a top-level binary search, followed by a linear scan through bit-packed data.)
An ADRP instruction followed by an ADD is used to emulate x86’s RIP-relative addressing. This is limited to a +/-1GB range. Rosetta 2 places the translated binary after the non-translated binary in memory, so you roughly have [untranslated code][data][translated code][runtime support code]. This means that ADRP can reference data and untranslated code as needed. Loading the runtime support functions immediately after the translated code also allows translated code to make direct calls into the runtime.
Return address prediction
All performant processors have a return-address-stack to allow branch prediction to correctly predict return instructions.
Rosetta 2 takes advantage of this by rewriting x86 CALL and RET instructions to ARM BL and RET instructions (as well as the architectural loads/stores and stack-pointer adjustments). This also requires some extra book-keeping, saving the expected x86 return-address and the corresponding translated jump target on a special stack when calling, and validating them when returning, but it allows for correct return prediction.
This trick is also used in the GameCube/Wii emulator Dolphin.
ARM flag-manipulation extensions
A lot of overhead comes from small differences in behaviour between x86 and ARM, like the semantics of flags. Rosetta 2 uses the ARM flag-manipulation extensions (FEAT_FlagM and FEAT_FlagM2) to handle these differences efficiently.
For example, x86 uses “subtract-with-borrow”, whereas ARM uses “subtract-with-carry”. This effectively inverts the carry flag when doing a subtraction, as opposed to when doing an addition. As CMP is a flag-setting subtraction without a result, it’s much more common to use the flags from a subtraction than an addition, so Rosetta 2 chooses inverted as the canonical form of the carry flag. The CFINV instruction (carry-flag-invert) is used to invert the carry after any ADD operation where the carry flag is used or may escape (and to rectify the carry flag, when it’s the input to an add-with-carry instruction).
x86 shift instructions also require complicated flag handling, as it shifts bits into the carry flag. The RMIF instruction (rotate-mask-insert-flags) is used within rosetta to move an arbitrary bit from a register into an arbitrary flag, which makes emulating fixed-shifts (among other things) relatively efficient. Variable shifts remain relatively inefficient if flags escape, as the flags must not be modified when shifting by zero, requiring a conditional branch.
Unlike x86, ARM doesn’t have any 8-bit or 16-bit operations. These are generally easy to emulate with wider operations (which is how compilers implement operations on these values), with the small catch that x86 requires preserving the original high-bits. However, the SETF8 and SETF16 instructions help to emulate the flag-setting behaviour of these narrower instructions.
Those were all from FEAT_FlagM. The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.
x86 and ARM both implement IEEE-754, so the most common floating-point operations are almost identical. One exception is the handling of the different possible bit patterns underlying NaN values, and another is whether tininess is detected before or after rounding. Most applications won’t mind if you get this wrong, but some will, and to get it right would require expensive checks on every floating-point operation. Fortunately, this is handled in hardware.
There’s a standard ARM alternate floating-point behaviour extension (FEAT_AFP) from ARMv8.7, but the M1 design predates the v8.7 standard, so Rosetta 2 uses a non-standard implementation.
Total store ordering (TSO)
One non-standard ARM extension available on the Apple M1 that has been widely publicised is hardware support for TSO (total-store-ordering), which, when enabled, gives regular ARM load-and-store instructions the same ordering guarantees that loads and stores have on an x86 system.
As far as I know this is not part of the ARM standard, but it also isn’t Apple specific: Nvidia Denver/Carmel and Fujitsu A64fx are other 64-bit ARM processors that also implement TSO (thanks to marcan for these details).
Apple’s secret extension
There are only a handful of different instructions that account for 90% of all operations executed, and, near the top of that list are addition and subtraction. On ARM these can optionally set the four-bit NZVC register, whereas on x86 these always set six flag bits: CF, ZF, SF and OF (which correspond well-enough to NZVC), as well as PF (the parity flag) and AF (the adjust flag).
Emulating the last two in software is possible (and seems to be supported by Rosetta 2 for Linux), but can be rather expensive. Most software won’t notice if you get these wrong, but some software will. The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.
Ultimately, the M1 is incredibly fast. By being so much wider than comparable x86 CPUs, it has a remarkable ability to avoid being throughput-bound, even with all the extra instructions Rosetta 2 generates. In some cases (iirc, IDA Pro) there really isn’t much of a speedup going from Rosetta 2 to native ARM.
I believe there’s significant room for performance improvement in Rosetta 2, by using static analysis to find possible branch targets, and performing inter-instruction optimisations between them. However, this would come at the cost of significantly increased complexity (especially for debugging), increased translation times, and less predictable performance (as it’d have to fall back to JIT translation when the static analysis is incorrect).
Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that. While other emulators might require inter-instruction optimisations for performance, Rosetta 2 is able to trust a fast CPU, generate code that respects its caches and predictors, and solve the messiest problems in hardware.
You can follow me at @firstname.lastname@example.org.
Update: SSE2 support
After reading some comments I realised this was a significant omission from the original post. Rosetta 2 provides full emulation for the SSE2 SIMD instruction set. These instructions have been enabled in compilers by default for many years, so this would have been required for compatibility. However, all common operations are translated to a reasonably-optimised sequence of NEON operations. This is critical to the performance of software that has been optimised to use these instructions.
Many emulators also use this SIMD to SIMD translation approach, but other use SIMD to scalar, or call out to runtime support functions for each SIMD operation.
Update: Unused flag optimisation
This is one of two inter-instruction optimisations mentioned earlier, but it deserves its own section. Rosetta 2 avoids computing flags when they are unused and don’t escape. This means that even with the flag-manipulation instructions, the vast majority of flag-setting x86 instructions can be translated to non-flag-setting ARM instructions, with no fix-up required. This improves instruction count and size a lot.
It’s even more valuable for Rosetta 2 in Linux VMs. In VMs, Rosetta is unable to enable the Apple parity-flag extension, and instead computes the value manually. (It may or may not also compute the adjust-flag). This is relatively expensive, so it’s very valuable to avoid it.
Update: Prologue/epilogue combining
The last inter-instruction optimisation I’m aware of is prologue/epilogue combining. Rosetta 2 finds groups of instructions that set up or tear down stack frames, and merges them, pairing loads and stores, and delaying stack-pointer updates. This is equivalent to the “stack engine” found in hardware implementations of x86.
For example, the following prologue:
push rbp mov rbp, rsp push rbx push rax
stur x5, [x4,#-8] sub x5, x4, #8 stp x0, x3, [x4,#-0x18]!
This greatly reduces the number of loads, stores, and arithmetic instructions that modify the stack-pointer, improving performance and code size. These paired loads and stores execute as a single operation on the Apple M1, which, as far as I know, isn’t possible on x86 CPUs, giving Rosetta 2 an advantage here.
Appendix: Research Method
This exploration was based on the methods and information described in Koh M. Nakagawa’s excellent Project Champollion.
To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah/*/*/unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on an old version of macOS, so things may have changed and improved since then.
Update: Appendix: Compatibility
Although it has nothing to do with why Rosetta 2 is fast, there are a couple of impressive compatibility features that seem worth mentioning.
Rosetta 2 has a full, slow, software implementation of x87’s 80-bit floating point numbers. This allows software that uses those instructions to run correctly. Windows on Arm approaches this by using 64-bit float operations, which generally works, but the decreased precision cause problems on in rare cases. Most software either doesn’t use x87, or was designed to run on older hardware, so even though this emulation is slow, the performance typically works out.
[Update: An earlier version of this post stated that I didn’t believe Windows on Arm handled x87. Thanks to kode54 for correcting this in a comment.]
Rosetta 2 also apparently supports the full 32-bit instruction set for Wine. Support for native 32-bit macOS applications was dropped prior to the launch of Apple Silicon, but support for the 32-bit x86 instruction set allegedly lives on. (I haven’t investigated this myself.)
Windows for ARM does support x87 instructions, but doesn’t appear to use a full soft floating point stack, instead opting to use a faster but guaranteed less accurate path of directly translating to scalar ARM math, which will only have up to 64 bit precision. This is good enough in most cases anyway, as most Windows C runtimes for the last 10+ years have defaulted to 64 bit precision and only switched to 32 or 80 bit precision if an application intentionally meddled with the FPU state.
Updated, thanks for the correction!
“Each x86 instruction is translated to one or more ARM instructions exactly once. When an indirect jump or call sets the instruction pointer to an arbitrary offset in the text segment, the runtime will look up the corresponding translated instruction, and branch there.”
So are you saying it maintains a table of all x86 ops and the address of their corresponding ARM64 ops? That sounds like it would have a hefty RAM footprint (and be D-cache-intensive) compared to the usual “just track branch targets” approach.
Yeah, I was incorrect about this, and I’ve updated that section to reflect my current understanding.
However, there is a full mapping in the other direction (ARM PC to x86 address). This is split into two levels, a top level binary search, and then second-level bit-packed delta-encoded entries for each instruction in the group.
Could you expand a bit on the M1’s undocumented feature enabling the computation of PF and AF and what that particular win looks like? Thanks.
I replied to this on Mastodon first, but for the benefit of others:
Both the “adjust flag” (AF) and the “parity flag” (PF) come from the 8080-family CPUs from the 1970s. Today they’re almost completely unused. The parity flag is set if, in the last eight bits of the result, the number of set-bits is odd. Otherwise, it’s cleared. The adjust flag is set if there’s a “carry out” from the low four-bits of the addition (and otherwise cleared). This was used for binary-coded decimal – that flag would indicate a carry from one 4-bit digit to the next.
Both these flags are computed on every ADD or SUB instruction (extremely often), and 64-bit ARM has no such functionality. For example, to compute the parity flag on ARM, a subtraction turns into something like:
subs w4, w4, w5
dup v24.16b, w4
cnt v23.16b, v24.16b
umov w22, v24.b
and w22, w22, #1
That’s a lot of work for one subtraction (5x as many instructions), and that’s not all of it – we didn’t compute AF.
But it’s not really a lot of work for Intel – in every x86 CPU, they just build some logic that computes both AF and PF at the same time as doing the subtraction. So, since Apple design their own CPUs, they decided to also build this logic. When running Rosetta 2, the CPU is configured to enable this functionality (since it’d break the ARM specification to have it enabled all the time). With that set Rosetta 2 does:
subs w4, w4, w5
And both flags are computed for it by the hardware.
Rosetta 2 can also run in Linux VMs on Apple Silicon. In a VM, it isn’t able to configured the host CPU, so it can’t use this functionality. There are two other options. Either, you can skip computing the flags, because they’re mostly useless and most software won’t care. Or you can compute them the long way shown above. Rosetta 2 chooses the second option, and this mostly works out fine, because they have an “unused flags” optimisation that avoids the computation a lot of the time.
Rosetta 2 worked very nicely for R, the statistical computing environment, which relies quite heavily on the details of floating point, including the use of NaN payloads to represent ‘missing data’. Actually compiling for the M1 took quite a bit of work, but Rosetta 2 was fast enough that it wasn’t too urgent.
Pingback: A Secret Apple Silicon Extension to Accommodate an Intel 8080 Artifact - Byte Cellar