Difficulties to measure C/C++ performance

后端 未结 3 1486
执笔经年
执笔经年 2021-02-19 18:07

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to w

3条回答
  •  庸人自扰
    2021-02-19 18:33

    Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.

    Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.

    Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.

    UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.

    Observations:

    1. In all cases the unpredictableIfs() code was inlined.
    2. Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
    3. Setting thread affinity to a single processor had no effect.
    4. The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
    5. Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
    6. Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
    7. If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)

    I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.

    UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.

    With volatile:

    Puzzling measurements:
    Unpredictable ifs took 643 msec; answer was 500000000
    Unpredictable ifs took 1248 msec; answer was 500000000
    Unpredictable ifs took 605 msec; answer was 500000000
    Unpredictable ifs took 4611 msec; answer was 500000000
    Unpredictable ifs took 4706 msec; answer was 500000000
    Unpredictable ifs took 4516 msec; answer was 500000000
    Unpredictable ifs took 4382 msec; answer was 500000000
    

    The disassembly for each instance looks like this:

        start = clock();
    010D1015  mov         esi,dword ptr [__imp__clock (10D20A0h)]  
    010D101B  add         esp,4  
    010D101E  call        esi  
    010D1020  mov         dword ptr [start],eax  
        sum = unpredictableIfs();
    010D1023  xor         ecx,ecx  
    010D1025  xor         eax,eax  
    010D1027  test        eax,40000002h  
    010D102C  jne         main+2Fh (10D102Fh)  
    010D102E  inc         ecx  
    010D102F  inc         eax  
    010D1030  cmp         eax,3B9ACA00h  
    010D1035  jl          main+27h (10D1027h)  
    010D1037  mov         dword ptr [sum],ecx  
        stop = clock();
    010D103A  call        esi  
    010D103C  mov         dword ptr [stop],eax  
    

    Without volatile:

    Puzzling measurements:
    Unpredictable ifs took 644 msec; answer was 500000000
    Unpredictable ifs took 624 msec; answer was 500000000
    Unpredictable ifs took 624 msec; answer was 500000000
    Unpredictable ifs took 605 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    Unpredictable ifs took 599 msec; answer was 500000000
    
        start = clock();
    00321014  mov         esi,dword ptr [__imp__clock (3220A0h)]  
    0032101A  add         esp,4  
    0032101D  call        esi  
    0032101F  mov         dword ptr [start],eax  
        sum = unpredictableIfs();
    00321022  xor         ebx,ebx  
    00321024  xor         eax,eax  
    00321026  test        eax,40000002h  
    0032102B  jne         main+2Eh (32102Eh)  
    0032102D  inc         ebx  
    0032102E  inc         eax  
    0032102F  cmp         eax,3B9ACA00h  
    00321034  jl          main+26h (321026h)  
        stop = clock();
    00321036  call        esi
    // The only optimization I see is here, where eax isn't explicitly stored
    // in stop but is instead immediately used to compute the value for the
    // printf that follows.
    

    Other than register selection, I don't see a significant difference.

提交回复
热议问题