Is if( a < 901 )
faster than if( a <= 900 )
.
Not exactly as in this simple example, but there are slight performance changes on loop
When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901
vs. a <= 900
. Many compilers always shrink the magnitude of constants by converting between <
and <=
, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.
For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000
would be encodeable, while cmp r0, #0x00efff
would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp
and cmn
, unlike 32-bit ARM and Thumb modes.
In assembly language on most machines, a comparison for <=
has the same cost as a comparison for <
. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.
But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <=
can accidentally create something the compiler can't optimize.
As a loop condition, there are cases where <=
is qualitatively different from <
, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.
Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size
will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
void foo(unsigned size) {
unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
for(unsigned i=0 ; i <= upper_bound ; i++)
...
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.
(A simple i <= size
would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)
In this case, size=0
leads to upper_bound=UINT_MAX
, and i <= UINT_MAX
is always true. So this loop is infinite for size=0
, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size
.
Asm like if(!size) skip the loop;
do{...}while(--size);
is one normally-efficient way to optimize a for( i
i
isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0
, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)
With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.
Using unsigned i <= n
defeats clang's idiom-recognition that optimizes sum(1 .. n)
loops with a closed form based on Gauss's n * (n+1) / 2
formula.
unsigned sum_1_to_n_finite(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i < n+1 ; ++i)
total += i;
return total;
}
x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer
# clang7.0 -O3 closed-form
cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
# else fall through into the closed-form calc
mov ecx, edi # zero-extend n into RCX
lea eax, [rdi - 1] # n-1
imul rax, rcx # n * (n-1) # 64-bit
shr rax # n * (n-1) / 2
add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
ret # computed without possible overflow of the product before right shifting
.LBB1_1:
xor eax, eax
ret
But for the naive version, we just get a dumb loop from clang.
unsigned sum_1_to_n_naive(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i<=n ; ++i)
total += i;
return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
xor ecx, ecx # i = 0
xor eax, eax # retval = 0
.LBB0_1: # do {
add eax, ecx # retval += i
add ecx, 1 # ++1
cmp ecx, edi
jbe .LBB0_1 # } while( i
GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i
values in parallel in the elements of an XMM register.
# "naive" inner loop
.L3:
add eax, 1 # do {
paffffd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
paffffd xmm1, xmm2 # vect_vec_iv_.5, tmp114
cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
ja .L3 #, # }while( n > i )
"finite" inner loop
# before the loop:
# xmm0 = 0 = totals
# xmm1 = {0,1,2,3} = i
# xmm2 = set1_epi32(4)
.L13: # do {
add eax, 1 # i++
paffffd xmm0, xmm1 # total[0..3] += i[0..3]
paffffd xmm1, xmm2 # i[0..3] += 4
cmp eax, edx
jne .L13 # }while( i != upper_limit );
then horizontal sum xmm0
and peeled cleanup for the last n%3 iterations, or something.
It also has a plain scalar loop which I think it uses for very small n
, and/or for the infinite loop case.
BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1
/jnz
instead of add eax,1
/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.