I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to w
Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock()
in MSVC returns elapsed wall time. The standard says clock()
should return an approximation of CPU time spent by the process, and other implementations do a better job of this.
Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.
Also note that clock()
on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.
UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.
Observations:
lea ecx,[ecx]
. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.volatile
from the start
and stop
variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not lea
s.)volatile
from the sum
variable (but keeping it for the clock timers), the long runs can happen in any position.volatile
qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi
was chosen for sum
instead of ecx
.)I'm not sure what to conclude from all this, except that volatile
has unpredictable performance consequences with MSVC, so you should apply it only when necessary.
UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.
With volatile:
Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000
The disassembly for each instance looks like this:
start = clock();
010D1015 mov esi,dword ptr [__imp__clock (10D20A0h)]
010D101B add esp,4
010D101E call esi
010D1020 mov dword ptr [start],eax
sum = unpredictableIfs();
010D1023 xor ecx,ecx
010D1025 xor eax,eax
010D1027 test eax,40000002h
010D102C jne main+2Fh (10D102Fh)
010D102E inc ecx
010D102F inc eax
010D1030 cmp eax,3B9ACA00h
010D1035 jl main+27h (10D1027h)
010D1037 mov dword ptr [sum],ecx
stop = clock();
010D103A call esi
010D103C mov dword ptr [stop],eax
Without volatile:
Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
start = clock();
00321014 mov esi,dword ptr [__imp__clock (3220A0h)]
0032101A add esp,4
0032101D call esi
0032101F mov dword ptr [start],eax
sum = unpredictableIfs();
00321022 xor ebx,ebx
00321024 xor eax,eax
00321026 test eax,40000002h
0032102B jne main+2Eh (32102Eh)
0032102D inc ebx
0032102E inc eax
0032102F cmp eax,3B9ACA00h
00321034 jl main+26h (321026h)
stop = clock();
00321036 call esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.
Other than register selection, I don't see a significant difference.