I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to w
With -O1
, gcc-4.7.1 calls unpredictableIfs
only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)
With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.
Apart from that, my gcc-4.7.1 deals best with unpredictableIfs
when using -O1
or -O2
(apart from the reuse issue, both produce the same code), while noIfs
is treated much better with -O3
. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock
), so I have no idea what could cause the substantially different times for unpredictableIfs
you reported for -O3
.
With -O2
, the loop for unpredictableIfs
is identical to the code generated with -O1
(except for register swapping):
.L12:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L12
and for noIfs
it's similar:
.L15:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
cmpl $1000000000, %eax
jne .L15
where it was
.L7:
testl $1073741826, %edx
sete %cl
movzbl %cl, %ecx
addl %ecx, %eax
addl $1, %edx
cmpl $1000000000, %edx
jne .L7
with -O1
. Both loops run in similar time, with unpredictableIfs
a bit faster.
With -O3
, the loop for unpredictableIfs
becomes worse,
.L14:
leal 1(%rdx), %ecx
testl $1073741826, %eax
cmove %ecx, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L14
and for noIfs
(including the setup-code here), it becomes better:
pxor %xmm2, %xmm2
movq %rax, 32(%rsp)
movdqa .LC3(%rip), %xmm6
xorl %eax, %eax
movdqa .LC2(%rip), %xmm1
movdqa %xmm2, %xmm3
movdqa .LC4(%rip), %xmm5
movdqa .LC5(%rip), %xmm4
.p2align 4,,10
.p2align 3
.L18:
movdqa %xmm1, %xmm0
addl $1, %eax
paffffd %xmm6, %xmm1
cmpl $250000000, %eax
pand %xmm5, %xmm0
pcmpeqd %xmm3, %xmm0
pand %xmm4, %xmm0
paffffd %xmm0, %xmm2
jne .L18
.LC2:
.long 0
.long 1
.long 2
.long 3
.align 16
.LC3:
.long 4
.long 4
.long 4
.long 4
.align 16
.LC4:
.long 1073741826
.long 1073741826
.long 1073741826
.long 1073741826
.align 16
.LC5:
.long 1
.long 1
.long 1
.long 1
it computes four iterations at once, and accordingly, noIfs
runs almost four times as fast then.