Difficulties to measure C/C++ performance

后端 未结 3 1470
执笔经年
执笔经年 2021-02-19 18:07

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to w

相关标签:
3条回答
  • 2021-02-19 18:26

    With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

    With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

    Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

    With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

    .L12:
        movl    %eax, %ecx
        andl    $1073741826, %ecx
        cmpl    $1, %ecx
        adcl    $0, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne .L12
    

    and for noIfs it's similar:

    .L15:
        xorl    %ecx, %ecx
        testl   $1073741826, %eax
        sete    %cl
        addl    $1, %eax
        addl    %ecx, %edx
        cmpl    $1000000000, %eax
        jne .L15
    

    where it was

    .L7:
        testl   $1073741826, %edx
        sete    %cl
        movzbl  %cl, %ecx
        addl    %ecx, %eax
        addl    $1, %edx
        cmpl    $1000000000, %edx
        jne .L7
    

    with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

    With -O3, the loop for unpredictableIfs becomes worse,

    .L14:
        leal    1(%rdx), %ecx
        testl   $1073741826, %eax
        cmove   %ecx, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne     .L14
    

    and for noIfs (including the setup-code here), it becomes better:

        pxor    %xmm2, %xmm2
        movq    %rax, 32(%rsp)
        movdqa  .LC3(%rip), %xmm6
        xorl    %eax, %eax
        movdqa  .LC2(%rip), %xmm1
        movdqa  %xmm2, %xmm3
        movdqa  .LC4(%rip), %xmm5
        movdqa  .LC5(%rip), %xmm4
        .p2align 4,,10
        .p2align 3
    .L18:
        movdqa  %xmm1, %xmm0
        addl    $1, %eax
        paffffd   %xmm6, %xmm1
        cmpl    $250000000, %eax
        pand    %xmm5, %xmm0
        pcmpeqd %xmm3, %xmm0
        pand    %xmm4, %xmm0
        paffffd   %xmm0, %xmm2
        jne .L18
    
    .LC2:
        .long   0
        .long   1
        .long   2
        .long   3
        .align 16
    .LC3:
        .long   4
        .long   4
        .long   4
        .long   4
        .align 16
    .LC4:
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .align 16
    .LC5:
        .long   1
        .long   1
        .long   1
        .long   1
    

    it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.

    0 讨论(0)
  • 2021-02-19 18:33

    Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.

    Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.

    Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.

    UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.

    Observations:

    1. In all cases the unpredictableIfs() code was inlined.
    2. Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
    3. Setting thread affinity to a single processor had no effect.
    4. The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
    5. Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
    6. Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
    7. If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)

    I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.

    UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.

    With volatile:

    Puzzling measurements:
    Unpredictable ifs took 643 msec; answer was 500000000
    Unpredictable ifs took 1248 msec; answer was 500000000
    Unpredictable ifs took 605 msec; answer was 500000000
    Unpredictable ifs took 4611 msec; answer was 500000000
    Unpredictable ifs took 4706 msec; answer was 500000000
    Unpredictable ifs took 4516 msec; answer was 500000000
    Unpredictable ifs took 4382 msec; answer was 500000000
    

    The disassembly for each instance looks like this:

        start = clock();
    010D1015  mov         esi,dword ptr [__imp__clock (10D20A0h)]  
    010D101B  add         esp,4  
    010D101E  call        esi  
    010D1020  mov         dword ptr [start],eax  
        sum = unpredictableIfs();
    010D1023  xor         ecx,ecx  
    010D1025  xor         eax,eax  
    010D1027  test        eax,40000002h  
    010D102C  jne         main+2Fh (10D102Fh)  
    010D102E  inc         ecx  
    010D102F  inc         eax  
    010D1030  cmp         eax,3B9ACA00h  
    010D1035  jl          main+27h (10D1027h)  
    010D1037  mov         dword ptr [sum],ecx  
        stop = clock();
    010D103A  call        esi  
    010D103C  mov         dword ptr [stop],eax  
    

    Without volatile:

    Puzzling measurements:
    Unpredictable ifs took 644 msec; answer was 500000000
    Unpredictable ifs took 624 msec; answer was 500000000
    Unpredictable ifs took 624 msec; answer was 500000000
    Unpredictable ifs took 605 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    
        start = clock();
    00321014  mov         esi,dword ptr [__imp__clock (3220A0h)]  
    0032101A  add         esp,4  
    0032101D  call        esi  
    0032101F  mov         dword ptr [start],eax  
        sum = unpredictableIfs();
    00321022  xor         ebx,ebx  
    00321024  xor         eax,eax  
    00321026  test        eax,40000002h  
    0032102B  jne         main+2Eh (32102Eh)  
    0032102D  inc         ebx  
    0032102E  inc         eax  
    0032102F  cmp         eax,3B9ACA00h  
    00321034  jl          main+26h (321026h)  
        stop = clock();
    00321036  call        esi
    // The only optimization I see is here, where eax isn't explicitly stored
    // in stop but is instead immediately used to compute the value for the
    // printf that follows.
    

    Other than register selection, I don't see a significant difference.

    0 讨论(0)
  • 2021-02-19 18:34

    Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.

    With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]

    UnpredictableIfs:

    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    

    NoIfs:

    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    

    As you can see, it's not quite identical, but it does very similar things.

    0 讨论(0)
提交回复
热议问题