问题
I was benchmarking some counting in a loop code. g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).
What I find in this asm output funny is that code jumps over one simple add.
=> 0x42b46b <benchmark_many_ints()+1659>: movslq (%rdx),%rax
0x42b46e <benchmark_many_ints()+1662>: mov %rax,%rcx
0x42b471 <benchmark_many_ints()+1665>: imul %r9,%rax
0x42b475 <benchmark_many_ints()+1669>: shr $0xe,%rax
0x42b479 <benchmark_many_ints()+1673>: and $0x1ff,%eax
0x42b47e <benchmark_many_ints()+1678>: cmp (%r10,%rax,4),%ecx
0x42b482 <benchmark_many_ints()+1682>: jne 0x42b488 <benchmark_many_ints()+1688>
0x42b484 <benchmark_many_ints()+1684>: add $0x1,%rbx
0x42b488 <benchmark_many_ints()+1688>: add $0x4,%rdx
0x42b48c <benchmark_many_ints()+1692>: cmp %rdx,%r8
0x42b48f <benchmark_many_ints()+1695>: jne 0x42b46b <benchmark_many_ints()+1659>
Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction. I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.
edit: source: https://godbolt.org/z/v0Iiv4
回答1:
The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if
is implemented with an if() { count++; }
, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc
or setcc
.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake
(implied by -march=skylake
) gives us branchless code for this regardless of -O2
vs. -O3
, or -fno-tree-vectorize
vs. -ftree-vectorize
. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&
, so we don't have to wade through the timing and cout
code-gen in main
.)
- branchy code: gcc8.2
-O2
or-O3
, andO2/3 -march=haswell
orbroadwell
- branchless code: gcc8.2
-O2/3 -march=skylake
.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov
. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov
is 2 uops? But I tested -march=broadwell
and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov
, adc
, and sbb
(3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm
) even with -march=haswell
, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq
per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd
.
I did rule out the function being less optimized because it was called main
(and marked cold
). It's generally recommended not to put your microbenchmarks in main
: compilers at least used to optimize main
differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2
.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int
, finding exactly the value you're looking for is rare. The ==
may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3
branchless code-gen was slower.
-O3
at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp
; lea 1(%rbx), %rcx
; cmove %rcx, %rbx
, or in this case more likely xor
-zero / cmp
/ sete
/ add
. (Actually gcc -march=skylake
uses sete
/ movzx
, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate
, run it, then compiler with -fprofile-use
, and you'll probably get branchless code.
BTW, -O3
is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops
by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)
来源:https://stackoverflow.com/questions/52107358/is-there-a-good-reason-why-gcc-would-generate-jump-to-jump-just-over-one-cheap-i