Indexed branch overhead on X86 64 bit mode

…衆ロ難τιáo~ 提交于 2019-11-28 02:11:09

This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.


Old answer: Why the computed jump is 1.4x faster than the loop with the result unused

You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.

Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.

OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n) dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)

It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)


I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib() returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib() sooner is a good thing.

I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8 / jb .cleanup).


Also note that your looping version has a data dependency on n for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add chain start working before n is ready. I don't think that's a big factor here, because the caller has low latency for n. But it does put the loop-branch mispredict on exiting the loop at the end of the n -> fib(n) dep chain instead of in the middle. (I'm picturing a branchless lea / cmov after the loop to do one more iteration if sub ecx, 2 went below zero instead of to zero.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!