Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

问题

I was benchmarking some counting in a loop code. g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).

What I find in this asm output funny is that code jumps over one simple add.

=> 0x42b46b <benchmark_many_ints()+1659>:       movslq (%rdx),%rax
   0x42b46e <benchmark_many_ints()+1662>:       mov    %rax,%rcx
   0x42b471 <benchmark_many_ints()+1665>:       imul   %r9,%rax
   0x42b475 <benchmark_many_ints()+1669>:       shr    $0xe,%rax
   0x42b479 <benchmark_many_ints()+1673>:       and    $0x1ff,%eax
   0x42b47e <benchmark_many_ints()+1678>:       cmp    (%r10,%rax,4),%ecx
   0x42b482 <benchmark_many_ints()+1682>:       jne    0x42b488 <benchmark_many_ints()+1688>
   0x42b484 <benchmark_many_ints()+1684>:       add    $0x1,%rbx
   0x42b488 <benchmark_many_ints()+1688>:       add    $0x4,%rdx
   0x42b48c <benchmark_many_ints()+1692>:       cmp    %rdx,%r8
   0x42b48f <benchmark_many_ints()+1695>:       jne    0x42b46b <benchmark_many_ints()+1659>

Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction. I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.

edit: source: https://godbolt.org/z/v0Iiv4

回答1:

The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:

const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
    return buckets[hash_val(val)%16] == val;});

I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)

It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)

branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.

That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).

I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.

I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).

Clang does make it branchless even with just -O2.

When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.

In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.

Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.

-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)

Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.

BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)

来源：https://stackoverflow.com/questions/52107358/is-there-a-good-reason-why-gcc-would-generate-jump-to-jump-just-over-one-cheap-i

标签

gcc

assembly

compiler-optimization