For loop performance difference, and compiler optimization

前端 未结 7 1303
耶瑟儿~
耶瑟儿~ 2021-02-14 00:40

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

7条回答
  •  情歌与酒
    2021-02-14 01:28

    (This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)

    The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.

    For completeness, here is the assembly (thanks @Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:

    superCalculationA(int, int):
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -20(%rbp)
        movl    %esi, -24(%rbp)
        movq    $0, -8(%rbp)
        movl    -20(%rbp), %eax
        movl    %eax, -12(%rbp)
        jmp .L7
    .L8:
        movl    -12(%rbp), %eax
        cltq
        addq    %rax, -8(%rbp)
        addl    $1, -12(%rbp)
    .L7:
        movl    -12(%rbp), %eax
        cmpl    -24(%rbp), %eax
        jl  .L8
        movq    -8(%rbp), %rax
        popq    %rbp
        ret
    
    superCalculationB(int, int):
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -20(%rbp)
        movl    %esi, -24(%rbp)
        movq    $0, -8(%rbp)
        movl    -20(%rbp), %eax
        movl    %eax, -12(%rbp)
        jmp .L11
    .L12:
        movl    -12(%rbp), %eax
        cltq
        addq    %rax, -8(%rbp)
        addl    $1, -12(%rbp)
    .L11:
        movl    -20(%rbp), %edx
        movl    -24(%rbp), %eax
        addl    %edx, %eax
        cmpl    -12(%rbp), %eax
        jg  .L12
        movq    -8(%rbp), %rax
        popq    %rbp
        ret
    

    It sure looks to me like B is doing more work.

    My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.

    To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:

    pkill cpuspeed # if you have it running
    grep MHz /proc/cpuinfo # to see where you start
    modprobe acpi_cpufreq # if you do not have it loaded
    cd /sys/devices/system/cpu 
    for cpuN in cpu[0-9]* ; do
        echo userspace > $cpuN/cpufreq/scaling_governor
        echo 2000000 > $cpuN/cpufreq/scaling_setspeed
    done
    grep MHz /proc/cpuinfo # to see if it worked
    

    This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.

    Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.

    I have yet to see any theory that even begins to explain this.

提交回复
热议问题