I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what
EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.
It appears that the performance difference in the -O0 case is due to processor pipelining.
First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # end
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L7
.L8:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # todo
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L11
.L12:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L11:
movl -20(%rbp), %edx # copy init to register rdx
movl -24(%rbp), %eax # copy todo to register rax
addl %edx, %eax # [rax] += [rdx] (so [rax] = init+todo)
cmpl -12(%rbp), %eax # i < [rax]
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
In both functions, the stack layout looks like this:
Addr Content
24 end/todo
20 init
16
12 i
08 total
04
00
(Note that total
is a 64-bit int and so occupies two 4-byte slots.)
These are the key lines of superCalculationA()
:
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
The stack address -12(%rbp)
(which holds the value of i
) is written to in the addl
instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA()
to be slower than superCalculationB()
.
You might be curious why superCalculationB()
doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA()
, the comparison i
i
from a register, while in superCalculationB()
, the comparison i
i
from the stack.
To demonstrate that this is just an artifact, let's replace
for (int i = init; i < end; i++)
with
for (int i = init; end > i; i++)
in superCalculateA()
. The generated assembly then looks the same, with just the following change to the key lines:
addl $1, -12(%rbp) # i++
.L7:
movl -24(%rbp), %eax # copy end to register rax
cmpl -12(%rbp), %eax # i < [rax]
Now i
is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:
=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i
end>i
.