Is < faster than <=?

后端 未结 14 856
孤城傲影
孤城傲影 2020-11-22 13:43

Is if( a < 901 ) faster than if( a <= 900 ).

Not exactly as in this simple example, but there are slight performance changes on loop

14条回答
  •  情话喂你
    2020-11-22 14:16

    When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.

    For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.


    < vs. <= in general, including for runtime-variable conditions

    In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.

    But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.

    As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.

    Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)

    void foo(unsigned size) {
        unsigned upper_bound = size - 1;  // or any calculation that could produce UINT_MAX
        for(unsigned i=0 ; i <= upper_bound ; i++)
            ...
    

    Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.

    (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)

    In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.

    Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).

    But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)

    With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.

    Example: sum of integers from 1 to n

    Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.

    unsigned sum_1_to_n_finite(unsigned n) {
        unsigned total = 0;
        for (unsigned i = 0 ; i < n+1 ; ++i)
            total += i;
        return total;
    }
    

    x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer

     # clang7.0 -O3 closed-form
        cmp     edi, -1       # n passed in EDI: x86-64 System V calling convention
        je      .LBB1_1       # if (n == UINT_MAX) return 0;  // C++ loop runs 0 times
              # else fall through into the closed-form calc
        mov     ecx, edi         # zero-extend n into RCX
        lea     eax, [rdi - 1]   # n-1
        imul    rax, rcx         # n * (n-1)             # 64-bit
        shr     rax              # n * (n-1) / 2
        add     eax, edi         # n + (stuff / 2) = n * (n+1) / 2   # truncated to 32-bit
        ret          # computed without possible overflow of the product before right shifting
    .LBB1_1:
        xor     eax, eax
        ret
    

    But for the naive version, we just get a dumb loop from clang.

    unsigned sum_1_to_n_naive(unsigned n) {
        unsigned total = 0;
        for (unsigned i = 0 ; i<=n ; ++i)
            total += i;
        return total;
    }
    
    # clang7.0 -O3
    sum_1_to_n(unsigned int):
        xor     ecx, ecx           # i = 0
        xor     eax, eax           # retval = 0
    .LBB0_1:                       # do {
        add     eax, ecx             # retval += i
        add     ecx, 1               # ++1
        cmp     ecx, edi
        jbe     .LBB0_1            # } while( i

    GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.

    # "naive" inner loop
    .L3:
        add     eax, 1       # do {
        paffffd   xmm0, xmm1    # vect_total_4.6, vect_vec_iv_.5
        paffffd   xmm1, xmm2    # vect_vec_iv_.5, tmp114
        cmp     edx, eax      # bnd.1, ivtmp.14     # bound and induction-variable tmp, I think.
        ja      .L3 #,       # }while( n > i )
    
     "finite" inner loop
      # before the loop:
      # xmm0 = 0 = totals
      # xmm1 = {0,1,2,3} = i
      # xmm2 = set1_epi32(4)
     .L13:                # do {
        add     eax, 1       # i++
        paffffd   xmm0, xmm1    # total[0..3] += i[0..3]
        paffffd   xmm1, xmm2    # i[0..3] += 4
        cmp     eax, edx
        jne     .L13      # }while( i != upper_limit );
    
         then horizontal sum xmm0
         and peeled cleanup for the last n%3 iterations, or something.
         
    

    It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.

    BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.

提交回复
热议问题