Is < faster than <=?

后端未结

关注

 14  856

孤城傲影 2020-11-22 13:43

Is if( a < 901 ) faster than if( a <= 900 ).

Not exactly as in this simple example, but there are slight performance changes on loop

14条回答

情话喂你 (楼主)

2020-11-22 14:16
When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.

For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.

< vs. <= in general, including for runtime-variable conditions

In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.

But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.

As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.

Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
```
void foo(unsigned size) {
    unsigned upper_bound = size - 1;  // or any calculation that could produce UINT_MAX
    for(unsigned i=0 ; i <= upper_bound ; i++)
        ...
```
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.

(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)

In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.

Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.) With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively. Example: sum of integers from 1 to n Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula. unsigned sum_1_to_n_finite(unsigned n) { unsigned total = 0; for (unsigned i = 0 ; i < n+1 ; ++i) total += i; return total; } x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer # clang7.0 -O3 closed-form cmp edi, -1 # n passed in EDI: x86-64 System V calling convention je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times # else fall through into the closed-form calc mov ecx, edi # zero-extend n into RCX lea eax, [rdi - 1] # n-1 imul rax, rcx # n * (n-1) # 64-bit shr rax # n * (n-1) / 2 add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit ret # computed without possible overflow of the product before right shifting .LBB1_1: xor eax, eax ret But for the naive version, we just get a dumb loop from clang. unsigned sum_1_to_n_naive(unsigned n) { unsigned total = 0; for (unsigned i = 0 ; i<=n ; ++i) total += i; return total; } # clang7.0 -O3 sum_1_to_n(unsigned int): xor ecx, ecx # i = 0 xor eax, eax # retval = 0 .LBB0_1: # do { add eax, ecx # retval += i add ecx, 1 # ++1 cmp ecx, edi jbe .LBB0_1 # } while( i GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register. # "naive" inner loop .L3: add eax, 1 # do { paffffd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5 paffffd xmm1, xmm2 # vect_vec_iv_.5, tmp114 cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think. ja .L3 #, # }while( n > i ) "finite" inner loop # before the loop: # xmm0 = 0 = totals # xmm1 = {0,1,2,3} = i # xmm2 = set1_epi32(4) .L13: # do { add eax, 1 # i++ paffffd xmm0, xmm1 # total[0..3] += i[0..3] paffffd xmm1, xmm2 # i[0..3] += 4 cmp eax, edx jne .L13 # }while( i != upper_limit ); then horizontal sum xmm0 and peeled cleanup for the last n%3 iterations, or something. It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case. BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.
0 讨论(0) 查看其它14个回答发布评论: 提交评论加载中...

Is < faster than <=?

< vs. <= in general, including for runtime-variable conditions

Example: sum of integers from 1 to n