Consider the following loop:
.loop:
add rsi, OFFSET
mov eax, dword [rsi]
dec ebp
jg .loop
where OFFSE
I think that @BeeOnRope's answer fully answers my question. I would to like to add some additional details here based on @BeeOnRope's answer and the comments under it. In particular, I'll show how to to determine whether a performance event occurs a fixed number of times per iteration for all load strides or not.
It's easy to see by looking at the code that it takes 3 uops to execute a single iteration. The first few loads might miss in the L1 cache, but then all later load will hit in the cache because all virtual pages are mapped to the same physical page and the L1 in Intel processors in physically tagged and indexed. So 3 uops. Now consider the UOPS_RETIRED.ALL
performance event, which occurs when a uop retires. We expect to see about 3 * number of iterations
such events. Hardware interrupts and page faults that occur during execution require microcode assist to handle, which will probably perturb the performance events. Therefore, for a specific measurement of a performance event X, the source of each counted event can be:
Hence, X = X1 +X2 + X3.
Since the code is simple, we were able to determine through static analysis that X1 = 3. But we don't know anything about X2 and X3, which may not be constant per iteration. We can measure X though using UOPS_RETIRED.ALL
. Fortunately, for our code, the number of page faults follows a regular pattern: exactly one per page accessed (which can be verified using perf
). It's reasonable to assume that the same amount of work is required to raise every page fault and so it will have the same impact on X every time. Note that this is in contrast to the number of page faults per iteration, which is different for different load strides. The number of uops retired as a direct result of executing the loop per page accessed is constant. Our code does not raise any software exceptions, so we don't have to worry about them. What about hardware interrupts? Well, on Linux, as long as we run the code on a core that is not assigned to handle mouse/keyboard interrupts, the only interrupt that really matters is the local APIC timer. Fortunately, this interrupt occurs regularly as well. As long as the amount of time spent per page is the same, the impact of the timer interrupt on X will be constant per page.
We can simplify the previous equation to:
X = X1 + X4.
Thus, for all load strides,
(X per page) - (X1 per page) = (X4 per page) = constant.
Now I'll discuss why this is useful and provide examples using different performance events. We are going to need the following denotations:
ec = total number of performance events (measured)
np = total number of virtual memory mappings used = minor page faults + major page faults (measured)
exp = expected number of performance events per iteration *on average* (unknown)
iter = total number of iterations. (statically known)
Note that in general, we don't know or are not sure of the performance event that we are interested in, which is why we would ever need to measure it. The case of retired uops was easy. But in general, this is what we need to find out or verify experimentally. Essentially, exp
is the count of performance events ec
but excluding those from raising page faults and interrupts.
Based on the argument and assumptions stated above, we can derive the following equation:
C = (ec/np) - (exp*iter/np) = (ec - exp*iter)/np
There are two unknows here: the constant C
and the value we are interested in exp
. So we need two equations to be able to calculate the unknowns. Since this equation holds for all strides, we can use measurements for two different strides:
C = (ec1 - exp*iter)/np1
C = (ec2 - exp*iter)/np2
We can find exp
:
(ec1 - exp*iter)/np1 = (ec2 - exp*iter)/np2
ec1*np2 - exp*iter*np2 = ec2*np1 - exp*iter*np1
ec1*np2 - ec2*np1 = exp*iter*np2 - exp*iter*np1
ec1*np2 - ec2*np1 = exp*iter*(np2 - np1)
Thus,
exp = (ec1*np2 - ec2*np1)/(iter*(np2 - np1))
Let's apply this equation to UOPS_RETIRED.ALL
.
stride1 = 32
iter = 10 million
np1 = 10 million * 32 / 4096 = 78125
ec1 = 51410801
stride2 = 64
iter = 10 million
np2 = 10 million * 64 / 4096 = 156250
ec2 = 72883662
exp = (51410801*156250 - 72883662*78125)/(10m*(156250 - 78125))
= 2.99
Nice! Very close to the expected 3 retired uops per iteration.
C = (51410801 - 2.99*10m)/78125 = 275.3
I've calculated C
for all strides. It's not exactly a constant, but it's 275+-1 for all strides.
exp
for other performance events can be derived similarly:
MEM_LOAD_UOPS_RETIRED.L1_MISS
: exp
= 0
MEM_LOAD_UOPS_RETIRED.L1_HIT
: exp
= 1
MEM_UOPS_RETIRED.ALL_LOADS
: exp
= 1
UOPS_RETIRED.RETIRE_SLOTS
: exp
= 3
So does this work for all performance events? Well, let's try something less obvious. Consider for example RESOURCE_STALLS.ANY
, which measures allocator stall cycles for any reason. It's rather hard to tell how much exp
should be by just looking at the code. Note that for our code, RESOURCE_STALLS.ROB
and RESOURCE_STALLS.RS
are zero. Only RESOURCE_STALLS.ANY
is significant here. Armed with the equation for exp
and experimental results for different strides, we can calculate exp
.
stride1 = 32
iter = 10 million
np1 = 10 million * 32 / 4096 = 78125
ec1 = 9207261
stride2 = 64
iter = 10 million
np2 = 10 million * 64 / 4096 = 156250
ec2 = 16111308
exp = (9207261*156250 - 16111308*78125)/(10m*(156250 - 78125))
= 0.23
C = (9207261 - 0.23*10m)/78125 = 88.4
I've calculated C
for all strides. Well, it doesn't look constant. Perhaps we should use different strides? No harm in trying.
stride1 = 32
iter1 = 10 million
np1 = 10 million * 32 / 4096 = 78125
ec1 = 9207261
stride2 = 4096
iter2 = 1 million
np2 = 1 million * 4096 / 4096 = 1m
ec2 = 102563371
exp = (9207261*1m - 102563371*78125)/(1m*1m - 10m*78125))
= 0.01
C = (9207261 - 0.23*10m)/78125 = 88.4
(Note that this time I used different number of iterations just to show that you can do that.)
We got a different value for exp
. I've calculated C
for all strides and it still does not look constant, as the following graph shows. It varies significantly for smaller strides and then slightly after 2048. This means that one or more of the assumptions that there is a fixed amount of allocator stall cycles per page is not valid that much. In other words, the standard deviation of the allocator stall cycles for different strides is significant.
For the UOPS_RETIRED.STALL_CYCLES
performance event, exp
= -0.32 and the standard deviation is also significant. This means that one or more of the assumptions that there is a fixed amount of retired stall cycles per page is not valid that much.
I've developed an easy way to correct measured number of retired instructions. Each triggered page fault will add exactly one extra event to the retired instructions counter. For example, assume that a page fault occurs regularly after some fixed number of iterations, say 2. That is, every two iterations, a fault is triggered. This happens for the code in the question when the stride is 2048. Since we expect 4 instructions to retire per iteration, the total number of expected retired instructions until a page fault occurs is then 4*2 = 8. Since a page fault adds one extra event to the retired instructions counter, it will be measured as 9 for the two iterations instead of 8. That is, 4.5 per iteration. When I actually measure the retired instructions count for the 2048 stride case, it is very close to 4.5. In all cases, when I apply this method to statically predict the value of the measured retired instruction per iteration, the error is always less than 1%. This is extremely accurate despite of hardware interrupts. I think that as long as the total execution time is less than 5 billion core cycles, hardware interrupts will not have any significant impact on the retired instructions counter. (Each one of my experiments took no more than 5 billion cycles, that's why.) But as explained above, one must always pay attention to the number of faults occurred.
As I have discussed above, there are many performance counters that can be corrected by calculating the per-page values. On the other hand, the retired instructions counter can be corrected by considering the number of iterations to get a page fault. RESOURCE_STALLS.ANY
and UOPS_RETIRED.STALL_CYCLES
perhaps can be corrected similarly to the retired instructions counter, but I have not investigated these two.