RDTSCP in NASM always returns the same value (timing a single instruction)

后端 未结 1 631
抹茶落季
抹茶落季 2020-11-30 09:39

I am using RDTSC and RDTSCP in NASM to measure machine cycles for various assembly language instructions to help in optimization.

I read \"How to Benchmark Code Ex

相关标签:
1条回答
  • 2020-11-30 09:56

    Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

    Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.


    Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

    You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.


    This is not how you time a single instruction

    What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

    You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.


    This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

    You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

    global _start
    _start:
        mov   ecx, 1000000000
    ; linux static executables start with XMM0..15 already zeroed
    align 32                     ; just for good measure to avoid uop-cache effects
    .loop:
        ;; LOOP BODY, put whatever you want to time in here
        times 4   addsd  xmm4, xmm3
    
        dec   ecx
        jnz   .loop
    
        mov  eax, 231
        xor  edi, edi
        syscall          ; x86-64 Linux sys_exit_group(0)
    

    Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

    (I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

    t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
     taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"
    

    Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

     Performance counter stats for './testloop' (4 runs):
    
              4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                    17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                     0      cpu-migrations            #    0.000 K/sec                  
                     2      page-faults               #    0.487 M/sec                  
        16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
         1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
         6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
         5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
         5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )
    
              4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )
    

    On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

    16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

    Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

    -r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

    Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

    But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)

    0 讨论(0)
提交回复
热议问题