The description of the RESOURCE_STALLS.RS
hardware performance event for Intel Broadwell is the following:
This event counts stall cycles
I've written a program that can be used to explore undocumented limitations of the RS in Intel processors in the hope that I'll be able to eventually answer the question. The basic idea is to make sure that the RS is completely empty before allocating and executing a specific sequence of uops in a loop. The RESOURCE_STALLS.RS
can be used to determine whether that sequence has hit a limitation in the RS itself. For example, if RESOURCE_STALLS.RS
is 1 per iteration, then the allocator had to stall for one cycle to allocate RS entries for all uops in the sequence. If RESOURCE_STALLS.RS
is much smaller than 1 per iteration, then it basically did not have to stall and so we know we didn't not hit any of the RS limitations.
I've experimented with a sequence of dependent ADD
instructions, a sequence of dependent BSWAP instructions, a sequence of dependent load instructions to the same location, a sequence of backward or forward unconditional jump instructions, and a sequence of store instructions to the same location. The following two graphs show the results for the sequence of add
instructions for different target RS occupancies (the maximum number of RS entries that will be simultaneously required and occupied by the sequence of uops). All the values are shown per iteration.
The following graph shows that RESOURCE_STALLS.RS
per iteration becomes at least (or anywhere near) 1 cycle per iteration when the RS occupancy is 50. Although it's not clearly visible, RESOURCE_STALLS.RS
becomes larger than zero when the RS occupancy exceeds 43, but only exceeds 1 when the RS occupancy exceeds 49. In other words, I'm only able to simultaneously use up to 49 RS entries out of the 60 (in Haswell) without RS stalls. After that, RESOURCE_STALLS.RS
increases on average by 1 per additional uop in the sequence, which is consistent with the bursty behavior of the allocator and the fact that each ADD
uop can be completed every cycle (each uop occupies an RS entry for 1 cycle only). cycles
increases on average by 2.3 per additional uop. It's larger than 1 per additional uop because there are also additional stalls on the ROB for reasons not related to the add
uops, but these are OK because they do not impact RESOURCE_STALLS.RS
.
The following graph shows the change in cycles
and RESOURCE_STALLS.RS
per iteration. It illustrates the strong correlation between execution time and RS stalls.
When the target RS occupancy is between 44-49, RESOURCE_STALLS.RS
is very small but still not really zero. I have also noticed that the exact order in which different uops are presented to the allocator slightly impacts the RS occupancy that can be reached. I think this is an effect of the RS array write port allocation scheme mentioned in the Intel manual.
So what's up with the other 11 RS entries (Haswell's RS is supposed to have 60 entries)? The RESOURCE_STALLS.ANY
performance event is the key to answer the question. I've updated the code I'm using to perform these experiments to test different kinds of loads:
loadspec
.loadnonspec
.loadspecreplay
.I followed the same approach with the ADD
instructions, but this time we need to watch RESOURCE_STALLS.ANY
instead of RESOURCE_STALLS.RS
(which actually doesn't capture RS stalls due to loads). The following graph shows the change in cycles
and RESOURCE_STALLS.ANY
per iteration. The first spike indicates the the target RS occupancy has exceeded the available RS entries for that kind of uop. We can clearly see that for the loadspec
case, there are exactly 11 RS entries for load uops! When the target RS occupancy exceeds 11, it takes on average 3.75 cycles for an RS entry to become free to the next load uop. This means that uops are deallocated from the RS when they complete, not when they get dispatched. This also explains how uop replay works. The spike for loadspecreplay
occurs at RS occupancy 6. The spike for loadnonspec
occurs at RS occupancy 9. As you will see later, these 11 entries are not dedicated for loads. Some of the 11 entries used by the loads may be among the 49 entries used by the ADD
uops.
I've also developed two test cases for stores: one that hits the limit of the store buffer and the other hits the limit of the RS. The graph above shows the former case. Note that a store needs two entries in the RS so the cases where the target RS occupancy is odd are the same as the previous even RS occupancies (change is zero). The graph shows that there can be up to 44/2 = 22 stores in the RS simultaneously. (The code I used to make the store graph had a bug in it that would make the achieved RS occupancy larger than what it is. After fixing it, the results show that there can be up to 20 stores in the RS simultaneously.) An entry occupied by a store-address or a store-data uop can be freed in one cycle. Intel says that the Haswell's store buffer has 42 entries, but I was not able to use all of these entries simultaneously. I'll probably have to design a different experiment to achieve that.
The jump sequences did not cause any stalls. I think this can be explained as follows: a jump uop frees the RS entry it occupies in one cycle and the allocator does not behave in a bursty way when it allocates jump uops. That is, every cycle one RS entry becomes free and the allocator will just allocate one jump uop without stalling. So we end up never stalling no matter how many jump uops there are. This is in contrast to add uops where the bursty allocator behavior makes it stall until the required number of RS entries become free (4 entries) even though the latency of an add uop is also one cycle. It makes sense that the jumps are allocated as soon as possible so that any mispredictions can be detected as early as possible. So if the allocator saw a jump and there is enough space in the RS for it but not later uops in its 4 uop group, then it would still allocate it. Otherwise, it might have to wait for potentially many cycles which can significantly delay the detection of mispredictions. This can be very costly
Is there an instruction whose uops can occupy all of the 60 entries of the RS simultaneously? Yes, one example is BSWAP
. It requires two RS entries for its two uops and I can clearly see using RESOURCE_STALLS.RS
that its uops can use all of the 60 entries of the RS simultaneously (assuming that my calculations are correct as to how the RS occupancy grows using the instruction). This proves that indeed there are exactly 60 entries in the RS. But there are constraints as to how they are used that we still don't know much about.