For loop performance difference, and compiler optimization

前端 未结 7 1332
耶瑟儿~
耶瑟儿~ 2021-02-14 00:40

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

7条回答
  •  闹比i
    闹比i (楼主)
    2021-02-14 01:17

    EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.


    It appears that the performance difference in the -O0 case is due to processor pipelining.

    First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:

    superCalculationA(int, int):
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -20(%rbp)    # init
        movl    %esi, -24(%rbp)    # end
        movq    $0, -8(%rbp)       # total = 0
        movl    -20(%rbp), %eax    # copy init to register rax
        movl    %eax, -12(%rbp)    # i = [rax]
        jmp .L7
    .L8:
        movl    -12(%rbp), %eax    # copy i to register rax
        cltq
        addq    %rax, -8(%rbp)     # total += [rax]
        addl    $1, -12(%rbp)      # i++
    .L7:
        movl    -12(%rbp), %eax    # copy i to register rax
        cmpl    -24(%rbp), %eax    # [rax] < end
        jl  .L8
        movq    -8(%rbp), %rax
        popq    %rbp
        ret
    
    superCalculationB(int, int):
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -20(%rbp)    # init
        movl    %esi, -24(%rbp)    # todo
        movq    $0, -8(%rbp)       # total = 0
        movl    -20(%rbp), %eax    # copy init to register rax
        movl    %eax, -12(%rbp)    # i = [rax]
        jmp .L11
    .L12:
        movl    -12(%rbp), %eax    # copy i to register rax
        cltq
        addq    %rax, -8(%rbp)     # total += [rax]
        addl    $1, -12(%rbp)      # i++
    .L11:
        movl    -20(%rbp), %edx    # copy init to register rdx
        movl    -24(%rbp), %eax    # copy todo to register rax
        addl    %edx, %eax         # [rax] += [rdx]  (so [rax] = init+todo)
        cmpl    -12(%rbp), %eax    # i < [rax]
        jg  .L12
        movq    -8(%rbp), %rax
        popq    %rbp
        ret
    

    In both functions, the stack layout looks like this:

    Addr Content
    
    24   end/todo
    20   init
    16   
    12   i
    08   total
    04   
    00   
    

    (Note that total is a 64-bit int and so occupies two 4-byte slots.)

    These are the key lines of superCalculationA():

        addl    $1, -12(%rbp)      # i++
    .L7:
        movl    -12(%rbp), %eax    # copy i to register rax
        cmpl    -24(%rbp), %eax    # [rax] < end
    

    The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().

    You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i is performed by reading i from a register, while in superCalculationB(), the comparison i is performed by reading i from the stack.

    To demonstrate that this is just an artifact, let's replace

    for (int i = init; i < end; i++)
    

    with

    for (int i = init; end > i; i++)
    

    in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:

        addl    $1, -12(%rbp)      # i++
    .L7:
        movl    -24(%rbp), %eax    # copy end to register rax
        cmpl    -12(%rbp), %eax    # i < [rax]
    

    Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:

    =====================================================
    Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
    Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
    The first method, i.e. superCalculationA, succeeded.
    The second method, i.e. superCalculationB, succeeded.
    =====================================================
    

    It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i or end>i.

提交回复
热议问题