rdtsc timing for a measuring a function

前端 未结 2 928
悲哀的现实
悲哀的现实 2021-01-17 06:19

I want to time a function call with rdtsc. So I measured it in two ways as follows.

  1. Call it in a loop. Aggregate each rdtsc difference within the loop and divi
相关标签:
2条回答
  • 2021-01-17 06:53

    Have you tried clock_gettime(CLOCK_MONOTONIC, &tp)? Should be quite near to reading the cycle counter by hand, also keep in mind that the cycle counter may not be synchronized between cpu cores.

    0 讨论(0)
  • 2021-01-17 07:02

    You use plain rdtsc instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:

    http://en.wikipedia.org/wiki/Time_Stamp_Counter

    Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

    Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:

    The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:

    asm volatile ("CPUID\n\t"
     "RDTSC\n\t"
     "mov %%edx, %0\n\t"
     "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
    "%rax", "%rbx", "%rcx", "%rdx");
    /***********************************/
    /*call the function to measure here*/
    /***********************************/
    asm volatile("RDTSCP\n\t"
     "mov %%edx, %0\n\t"
     "mov %%eax, %1\n\t"
     "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
    "%rax", "%rbx", "%rcx", "%rdx");
    
    start = ( ((uint64_t)cycles_high << 32) | cycles_low );
    end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );
    

    In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as “inline” so that from an assembly perspective there is no overhead in calling the function itself. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.

    You example is not very correct; you try to measure empty function bar(), but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it) RDTSC has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP (for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.

    0 讨论(0)
提交回复
热议问题