Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and measure some of these buffers in the Apple M1 processor’s Firestorm and Icestorm microarchitectures, by timing different instruction patterns.
I’ve measured the size of the load and store queue, discovered that load and store queue entries are allocated when the ops issue from the scheduler, and released once they are non-speculative and all earlier loads and stores have been released. I may have also accidentally found a trick for manipulating load/store alias prediction. And I figured I should write it up, so other people can reproduce it, and/or find mistakes.
Fighting Loads with Loads
The general idea is to time the execution of two independent long-latency operations (instructions, or chains of dependent instructions), with a number of loads or stores between them. Usually these two long-latency operations can run in parallel, but if there are so many loads/stores that some buffer is completely filled, the machine will stall, and will have to wait until space in the buffer is freed to execute subsequent instructions. This method was described first (and much more clearly) in Henry Wong’s blog post Measuring Reorder Buffer Capacity.
Initially, in measuring the M1, I used cache-missing loads as the “two independent long-latency operations” (as did everyone else, to my knowledge).
My result (which very roughly looked like AnandTech’s results) was that 128 loads or 107 stores could run, without stopping the final long-latency load from executing in parallel, but adding one more would cause a stall. Since the first load, 128 other loads, and the last load are in flight at the same time, I’d call this a load queue size of 130. I still believe this to be correct. On the other hand, the loads don’t require store buffers, so I incorrectly called this a 107 entry store queue.
Schedulers and Dispatch Queues
Although it is not the topic of this post, I also mapped out the structure of the dispatch queues and schedulers:
If an op can enter a scheduler, it can leave the scheduler and execute when all its dependencies are ready. If it can enter a dispatch queue, then it will enter a scheduler when there’s room, and until then, the frontend can continue. But if the scheduler and dispatch queue are full, and we have an op that needs to go to that queue, we stall. This means no more ops can enter any dispatch queues until there is room in that queue.
It is worth noting that the load/store scheduler has 48 entries, with a 10 entry dispatch queue.
(Very briefly: These were measured by filling the scheduler with some number of loads or stores, with addresses depending on the first long-latency chain, then finding two points. First, I found the last point at which an independent load could still fit into the scheduler and run, to find the scheduler size. Then, I found the number of extra load/stores, once the scheduler was full, that were required to block independent floating-point operations. That gives the dispatch queue size. Mixing operations makes it possible to figure out which schedulers/queues are separate and which are not. Should I write this up?)
Fighting Stores with Square-Roots
The next idea was to use a chain of long-latency floating point operations instead. Surprisingly, this produces a result that about 329 loads or stores can run between floating point operations, without forcing the two long-latency chains to run completely in parallel. I assume this means that load and store queue entries are being released, and reused, before the first long-latency chain has retired, and we’re hitting another limit. Mixing loads and stores confirms it’s the same limit. (It’s not the topic of this post, but I’ve explored it a bit and called it the “coalescing retire queue”. I believe it is their ROB implementation.)
So at this point I’m guessing that loads and stores can complete (free their load/store queue entries, but not their retire queue entries) once they are non-speculative. I believe this completion is in-order relative to other loads and stores. The approach I used to test this was to have a single, initial load or store operation, with its address depending on the result of the first long-latency operation.
However, for this writeup, I got the same result by instead adding a branch, dependent on the result of the first long-latency operation. This will ensure the loads and stores cannot become non-speculative, and keep their queue entries. In this test we see we can run 188 loads or 118 stores without forcing the two long-latency chains to run completely in parallel. This was initially pretty confusing, since we believe we only have a 130 entry load buffer. So, where did the extra 58 entries come from?
The load/store scheduler has 48 entries, with a 10 entry dispatch queue. If the load/store scheduler and dispatch queue are full, the integer and floating point units can continue operating. But if we hit one more load/store, the machine stalls, as it has nowhere to put the instruction. This explains the 58 extra entries.
By this logic (subtracting 58 for the size of the scheduler and dispatch queue), the store queue has only 60 entries. So why did we think it had 47 more? Because if the 48 entry scheduler is almost full, but has one free entry, then a load can enter the scheduler, and then issue from the scheduler and execute (in parallel with the other long-latency load), but if the scheduler is completely full, it cannot.
So those are my current numbers, 130 load queue entries and 60 store queue entries. The same logic works for Icestorm, where we see 30 load queue entries and 18 store queue entries (with an 18 entry scheduler and a 6 entry dispatch queue).
An Uncomfortable, Surprising Discovery
In writing this up for this post, I tried to reproduce the “107 entry store queue” result, but the result I got was 60 entries. This is both the best answer I could hope for (it’s the number I currently think is correct), and the worst possible outcome (I’m writing this up, so that people can reproduce my work, and I’m failing to reproduce my own results).
So what went wrong? Bisecting a little, I found this was caused by refactoring to use X29 as the base address for the second long-latency load (with a variable offset), and using X29 or SP as the base address for the store (with a constant offset). Changing either or both registers back to X3 (even though the values did not change) gave me 107 again.
Processors try to execute loads and stores out of order, which makes things fast, as long as the data read by a load isn’t later changed by a preceding (in program order) store. This is called a memory order violation. When it happens, a processor typically throws away all its work and starts again, which is very expensive. (Apple provides a performance counter for this. It’s called MEMORY_ORDER_VIOLATION and described as “Incorrect speculation between store and dependent load”.) Because this is so expensive, there are predictors that try to figure out when a load and a store might alias, and run them in order instead. You can read more about how Intel approaches this in Travis Downs’ Memory Disambiguation on Skylake writeup.
X29 is typically used as the stack frame base pointer, and I suspect the memory dependency prediction has a special case for this, and makes the load wait until it knows the store doesn’t alias it. The theory is that if we have 60 speculative stores before the load, we can figure out all their addresses, and that the load can go ahead. But if we have 61, we can’t check the last one, so the load will wait.
The same code running on Icestorm also measures the 18 entry store buffer.
I think this explanation makes sense, but it was a surprising reminder that there’s still a lot of mysteries here, and it’d be good to verify it further. I put the code to reproduce this result in a gist if others want to investigate this.
So, here’s the data. To get a single number, I find the top of the jump (the first point at which it’s executing completely serially) and subtract one. This is the largest number of instructions that execute a measurable amount faster than completely serially. But however you pick, the results are close enough.
(Note that the dependent B.cc is chained using FCMP, as I think FMOV might interfere with the memory operations.)
The difference between measuring using the resource itself (loads+loads), and measuring using another resource (fsqrts+loads) is very clear in both graphs. The 58 instruction difference implies that when we do not have the resources to execute more loads, we can continue move more loads into the scheduler. So I conclude that the resources are allocated late. Similarly, the ~330 limit we can hit implies that this resource can be freed early.
I do not see this kind of pattern when measuring physical register file sizes (e.g. comparing int+int vs int+float), so I believe they are neither allocated late, nor freed early. But there’s a lot of complexity I do not yet understand.