Is < faster than <=?

后端 未结 14 818
孤城傲影
孤城傲影 2020-11-22 13:43

Is if( a < 901 ) faster than if( a <= 900 ).

Not exactly as in this simple example, but there are slight performance changes on loop

相关标签:
14条回答
  • 2020-11-22 14:16

    When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.

    For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.


    < vs. <= in general, including for runtime-variable conditions

    In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.

    But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.

    As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.

    Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)

    void foo(unsigned size) {
        unsigned upper_bound = size - 1;  // or any calculation that could produce UINT_MAX
        for(unsigned i=0 ; i <= upper_bound ; i++)
            ...
    

    Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.

    (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)

    In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.

    Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).

    But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)

    With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.

    Example: sum of integers from 1 to n

    Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.

    unsigned sum_1_to_n_finite(unsigned n) {
        unsigned total = 0;
        for (unsigned i = 0 ; i < n+1 ; ++i)
            total += i;
        return total;
    }
    

    x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer

     # clang7.0 -O3 closed-form
        cmp     edi, -1       # n passed in EDI: x86-64 System V calling convention
        je      .LBB1_1       # if (n == UINT_MAX) return 0;  // C++ loop runs 0 times
              # else fall through into the closed-form calc
        mov     ecx, edi         # zero-extend n into RCX
        lea     eax, [rdi - 1]   # n-1
        imul    rax, rcx         # n * (n-1)             # 64-bit
        shr     rax              # n * (n-1) / 2
        add     eax, edi         # n + (stuff / 2) = n * (n+1) / 2   # truncated to 32-bit
        ret          # computed without possible overflow of the product before right shifting
    .LBB1_1:
        xor     eax, eax
        ret
    

    But for the naive version, we just get a dumb loop from clang.

    unsigned sum_1_to_n_naive(unsigned n) {
        unsigned total = 0;
        for (unsigned i = 0 ; i<=n ; ++i)
            total += i;
        return total;
    }
    
    # clang7.0 -O3
    sum_1_to_n(unsigned int):
        xor     ecx, ecx           # i = 0
        xor     eax, eax           # retval = 0
    .LBB0_1:                       # do {
        add     eax, ecx             # retval += i
        add     ecx, 1               # ++1
        cmp     ecx, edi
        jbe     .LBB0_1            # } while( i<n );
        ret
    

    GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.

    # "naive" inner loop
    .L3:
        add     eax, 1       # do {
        paffffd   xmm0, xmm1    # vect_total_4.6, vect_vec_iv_.5
        paffffd   xmm1, xmm2    # vect_vec_iv_.5, tmp114
        cmp     edx, eax      # bnd.1, ivtmp.14     # bound and induction-variable tmp, I think.
        ja      .L3 #,       # }while( n > i )
    
     "finite" inner loop
      # before the loop:
      # xmm0 = 0 = totals
      # xmm1 = {0,1,2,3} = i
      # xmm2 = set1_epi32(4)
     .L13:                # do {
        add     eax, 1       # i++
        paffffd   xmm0, xmm1    # total[0..3] += i[0..3]
        paffffd   xmm1, xmm2    # i[0..3] += 4
        cmp     eax, edx
        jne     .L13      # }while( i != upper_limit );
    
         then horizontal sum xmm0
         and peeled cleanup for the last n%3 iterations, or something.
         
    

    It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.

    BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.

    0 讨论(0)
  • 2020-11-22 14:20

    Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.

    If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).

    0 讨论(0)
提交回复
热议问题