I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what
(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)
The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2
and -O3
cases are relatively simple to explain, and others have done so.
For completeness, here is the assembly (thanks @Rutan Kax) for superCalculationA
and superCalculationB
produced by GCC 4.9.1:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L7
.L8:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L7:
movl -12(%rbp), %eax
cmpl -24(%rbp), %eax
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L11
.L12:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L11:
movl -20(%rbp), %edx
movl -24(%rbp), %eax
addl %edx, %eax
cmpl -12(%rbp), %eax
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
It sure looks to me like B is doing more work.
My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.
To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:
pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu
for cpuN in cpu[0-9]* ; do
echo userspace > $cpuN/cpufreq/scaling_governor
echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked
This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.
Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB
consistently runs 25-30% faster than superCalculationA
, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA
consistently takes a little over 4500ms and superCalculationB
consistently takes at little under 3600ms.
I have yet to see any theory that even begins to explain this.