For loop performance difference, and compiler optimization

前端未结

关注

 7  1332

耶瑟儿～ 2021-02-14 00:40

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

7条回答

闹比i (楼主)

2021-02-14 01:17

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.

It appears that the performance difference in the -O0 case is due to processor pipelining.

First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:

superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # end
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L7
.L8:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # todo
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L11
.L12:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L11:
    movl    -20(%rbp), %edx    # copy init to register rdx
    movl    -24(%rbp), %eax    # copy todo to register rax
    addl    %edx, %eax         # [rax] += [rdx]  (so [rax] = init+todo)
    cmpl    -12(%rbp), %eax    # i < [rax]
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

In both functions, the stack layout looks like this:

Addr Content

24   end/todo
20   init
16   
12   i
08   total
04   
00

(Note that total is a 64-bit int and so occupies two 4-byte slots.)

These are the key lines of superCalculationA():

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end

The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().

You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i is performed by reading i from a register, while in superCalculationB(), the comparison i is performed by reading i from the stack.



To demonstrate that this is just an artifact, let's replace 

for (int i = init; i < end; i++)


with

for (int i = init; end > i; i++)


in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -24(%rbp), %eax    # copy end to register rax
    cmpl    -12(%rbp), %eax    # i < [rax]


Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:

=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================


It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i or end>i.