I want to time a function call with rdtsc. So I measured it in two ways as follows.
Have you tried clock_gettime(CLOCK_MONOTONIC, &tp)
? Should be quite near to reading the cycle counter by hand, also keep in mind that the cycle counter may not be synchronized between cpu cores.
You use plain rdtsc
instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:
http://en.wikipedia.org/wiki/Time_Stamp_Counter
Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.
Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:
The solution to the problem presented in Section 0 is to add a CPUID instruction just after the
RDTPSCP
and the twomov
instructions (to store in memory the value ofedx
andeax
). The implementation is as follows:
asm volatile ("CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
/***********************************/
/*call the function to measure here*/
/***********************************/
asm volatile("RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");
start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );
In the code above, the first
CPUID
call implements a barrier to avoid out-of-order execution of the instructions above and below theRDTSC
instruction. Nevertheless, this call does not affect the measurement since it comes before theRDTSC
(i.e., before the timestamp register is read). The firstRDTSC
then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as “inline
” so that from an assembly perspective there is no overhead in calling the function itself. TheRDTSCP
instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.
You example is not very correct; you try to measure empty function bar()
, but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)
). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it)
RDTSC
has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP
(for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.