For loop performance difference, and compiler optimization

前端未结

关注

 7  1321

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

相关标签:

7条回答

情歌与酒

2021-02-14 01:28
(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)

The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.

For completeness, here is the assembly (thanks @Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:
```
superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)
    movl    %esi, -24(%rbp)
    movq    $0, -8(%rbp)
    movl    -20(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp .L7
.L8:
    movl    -12(%rbp), %eax
    cltq
    addq    %rax, -8(%rbp)
    addl    $1, -12(%rbp)
.L7:
    movl    -12(%rbp), %eax
    cmpl    -24(%rbp), %eax
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)
    movl    %esi, -24(%rbp)
    movq    $0, -8(%rbp)
    movl    -20(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp .L11
.L12:
    movl    -12(%rbp), %eax
    cltq
    addq    %rax, -8(%rbp)
    addl    $1, -12(%rbp)
.L11:
    movl    -20(%rbp), %edx
    movl    -24(%rbp), %eax
    addl    %edx, %eax
    cmpl    -12(%rbp), %eax
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret
```
It sure looks to me like B is doing more work.

My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.

To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:
```
pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu 
for cpuN in cpu[0-9]* ; do
    echo userspace > $cpuN/cpufreq/scaling_governor
    echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked
```
This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.

Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.

I have yet to see any theory that even begins to explain this.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2