问题
This is a follow up to some comments made in this prior thread:
Recursive fibonacci Assembly
The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.
Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.
Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.
loop version (note I've read that | dec
| jnz
| is faster than | loop
|) :
align 16
fib proc ;rcx == n
mov rax,rcx ;br if < 2
cmp rax,2
jb fib1
mov rdx,1 ;set rax, rdx
and rax,rdx
sub rdx,rax
shr rcx,1
fib0: add rdx,rax
add rax,rdx
dec rcx
jnz fib0
fib1: ret
fib endp
computed jump (indexed branch) into unfolded loop version:
align 16
fib proc ;rcx == n
mov r8,rcx ;set jmp adr
mov r9,offset fib0+279
lea r8,[r8+r8*2]
neg r8
add r8,r9
mov rax,rcx ;set rax,rdx
mov rdx,1
and rax,rdx
sub rdx,rax
jmp r8
fib0: ; assumes add xxx,xxx takes 3 bytes
rept 46
add rax,rdx
add rdx,rax
endm
add rax,rdx
ret
fib endp
Test code that runs 1 million (1048576) loops to calculate fib(0)
to fib(93)
using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.
#include <stdio.h>
#include <time.h>
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
extern "C" uint64_t fib(uint64_t);
/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] =
{0,37,74,18,55,92,36,73,17,54,
91,35,72,16,53,90,34,71,15,52,
89,33,70,14,51,88,32,69,13,50,
87,31,68,12,49,86,30,67,11,48,
85,29,66,10,47,84,28,65, 9,46,
83,27,64, 8,45,82,26,63, 7,44,
81,25,62, 6,43,80,24,61, 5,42,
79,23,60, 4,41,78,22,59, 3,40,
77,21,58, 2,39,76,20,57, 1,38,
75,19,56,93};
/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
cbeg = clock();
for(j = 0; j < 0x100000; j++)
for(i = 0; i < 94; i++)
x += fib(a[i]);
cend = clock();
printf("%llx\n", x);
printf("# ticks = %u\n", (uint32_t)(cend-cbeg));
return 0;
}
The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.
I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.
includelib msvcrtd
includelib oldnames
.data
; multiples of 37 mod 93 + 93 at the end
a dq 0,37,74,18,55,92,36,73,17,54
dq 91,35,72,16,53,90,34,71,15,52
dq 89,33,70,14,51,88,32,69,13,50
dq 87,31,68,12,49,86,30,67,11,48
dq 85,29,66,10,47,84,28,65, 9,46
dq 83,27,64, 8,45,82,26,63, 7,44
dq 81,25,62, 6,43,80,24,61, 5,42
dq 79,23,60, 4,41,78,22,59, 3,40
dq 77,21,58, 2,39,76,20,57, 1,38
dq 75,19,56,93
.data?
.code
; parameters rcx,rdx,r8,r9
; not saved rax,rcx,rdx,r8,r9,r10,r11
; code starts on 16 byte boundary
main proc
push r15
push r14
push r13
push r12
push rbp
mov rbp,rsp
and rsp,0fffffffffffffff0h
sub rsp,64
mov r15,offset a
xor r14,r14
mov r11,0100000h
; nop padding effect on loop version (with 0 padding in padx below)
; 0 puts main2 on odd 16 byte boundary clk = 0131876622h => 1.465 seconds
; 9 puts main1 on odd 16 byte boundary clk = 01573FE951h => 1.645 seconds
rept 0
nop
endm
rdtsc
mov r12,rdx
shl r12,32
or r12,rax
main0: xor r10,r10
main1: mov rcx,[r10+r15]
call fib
main2: add r14,rax
add r10,8
cmp r10,8*94
jne main1
dec r11
jnz main0
rdtsc
mov r13,rdx
shl r13,32
or r13,rax
sub r13,r12
mov rdx,r14
xor rax,rax
mov rsp,rbp
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
main endp
align 16
padx proc
; nop padding effect on loop version with 0 padding above
; 0 puts fib on odd 16 byte boundary clk = 0131876622h => 1.465 seconds
; 16 puts fib on even 16 byte boundary clk = 01A13C8CB8h => 2.000 seconds
; nop padding effect on computed jump version with 9 padding above
; 0 puts fib on odd 16 byte boundary clk = 00D979792Dh => 1.042 seconds
; 16 puts fib on even 16 byte boundary clk = 00DA93E04Dh => 1.048 seconds
rept 0
nop
endm
padx endp
if 1 ;0 = loop version, 1 = computed jump version
fib proc ;rcx == n
mov r8,rcx ;set jmp adr
mov r9,offset fib0+279
lea r8,[r8+r8*2]
neg r8
add r8,r9
mov rax,rcx ;set rax,rdx
mov rdx,1
and rax,rdx
sub rdx,rax
jmp r8
fib0: ; assumes add xxx,xxx takes 3 bytes
rept 46
add rax,rdx
add rdx,rax
endm
add rax,rdx
ret
fib endp
else
fib proc ;rcx == n
mov rax,rcx ;br if < 2
cmp rax,2
jb fib1
mov rdx,1 ;set rax, rdx
and rax,rdx
sub rdx,rax
shr rcx,1
fib0: add rdx,rax
add rax,rdx
dec rcx
jnz fib0
fib1: ret
fib endp
endif
end
回答1:
This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add
loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard
) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.
Old answer: Why the computed jump is 1.4x faster than the loop with the result unused
You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.
Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.
OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n)
dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)
It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc
into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)
I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib()
returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib()
sooner is a good thing.
I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8
/ jb .cleanup
).
Also note that your looping version has a data dependency on n
for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add
chain start working before n
is ready. I don't think that's a big factor here, because the caller has low latency for n
. But it does put the loop-branch mispredict on exiting the loop at the end of the n
-> fib(n)
dep chain instead of in the middle. (I'm picturing a branchless lea
/ cmov
after the loop to do one more iteration if sub ecx, 2
went below zero instead of to zero.)
来源:https://stackoverflow.com/questions/47052536/indexed-branch-overhead-on-x86-64-bit-mode