Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

后端 未结 1 728
春和景丽
春和景丽 2021-01-13 09:57

The front end of recent Intel CPUs contains one complex decoder and a number of simple decoders. The complex decoder can handle instructions that decode to multiple µops, wh

1条回答
  •  遥遥无期
    2021-01-13 10:09

    No, there are some instructions that can only decode 1/clock

    Andreas's comments indicate that xor eax,eax / setnle al seems to have a decode bottleneck of 1/clock. I found the same thing with cdq: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn't involve partial-registers or anything at all weird, and doesn't need a dep-breaking instruction.

    Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog's tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo. IDK why you'd find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)

    (Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)


    times 10 cdq in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.

    times 20 cdq is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.

    default rel
    %ifdef __YASM_VER__
        CPU Skylake AMD
    %else
    %use smartalign
    alignmode p6, 64
    %endif
    
    global _start
    _start:
        mov  ebp, 1000000000
    
    align 64
    .loop:
        ;times 10 cdq   ; 0.5c throughput
        ;times 20 cdq   ; 1c throughput, 1 MITE uop per cycle front-end
    
        ; times 10 cqo        ; 0.5c throughput 2-byte insn fits uop cache
        ; times 10 cdqe       ; 1c throughput data dependency
        ;times 10 cld         ; ~4c throughput, 3 uops
    
        dec ebp
        jnz .loop
    .end:
    
        xor edi,edi
        mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
        syscall       ; sys_exit_group(0)
    

    On my Arch Linux desktop, I built this into a static executable to run under perf:

    • i7-6700k with epp=balance_performance (max "turbo" = 3.9GHz)
    • microcode revision 0xd6 (so LSD disabled, not that it matters: loops can only run from the LSD loop buffer if all their uops are in the DSB uop cache, IIRC.)
         in a bash shell:
    t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"
    

    disassembly

    0000000000401000 <_start>:
      401000:       bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
      401005:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]
    ...
      40103d:       0f 1f 00                nop    DWORD PTR [rax]
    
    0000000000401040 <_start.loop>:
      401040:       99                      cdq    
      401041:       99                      cdq    
      401042:       99                      cdq    
      401043:       99                      cdq    
    ...
      401052:       99                      cdq    
      401053:       99                      cdq             # 20 total CDQ
      401054:       ff cd                   dec    ebp
      401056:       75 e8                   jne    401040 <_start.loop>
    
    0000000000401058 <_start.end>:
      401058:       31 ff                   xor    edi,edi
      40105a:       b8 e7 00 00 00          mov    eax,0xe7
      40105f:       0f 05                   syscall 
    

    Perf results:

     Performance counter stats for './cdq-latency':
    
              5,205.44 msec task-clock                #    1.000 CPUs utilized          
                     0      context-switches          #    0.000 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                     1      page-faults               #    0.000 K/sec                  
        20,124,711,776      cycles                    #    3.866 GHz                      (49.88%)
        22,015,118,295      instructions              #    1.09  insn per cycle           (59.91%)
        21,004,212,389      uops_issued.any           # 4035.049 M/sec                    (59.97%)
         1,005,872,141      frontend_retired.dsb_miss #  193.235 M/sec                    (60.03%)
                     0      idq.dsb_uops              #    0.000 K/sec                    (60.08%)
        20,997,157,414      idq.mite_uops             # 4033.694 M/sec                    (60.12%)
        19,996,447,738      idq.mite_cycles           # 3841.451 M/sec                    (40.03%)
        59,048,559,790      idq_uops_not_delivered.core # 11343.621 M/sec                   (39.97%)
           112,956,733      idq_uops_not_delivered.cycles_fe_was_ok #   21.700 M/sec                    (39.92%)
               209,490      idq.all_mite_cycles_4_uops #    0.040 M/sec                    (39.88%)
    
           5.206491348 seconds time elapsed
    

    So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:

    # same source, only these HW counters enabled to avoid multiplexing
              5,161.14 msec task-clock                #    1.000 CPUs utilized          
    
        20,107,065,550      cycles                    #    3.896 GHz                    
        20,000,134,955      idq.mite_cycles           # 3875.142 M/sec                  
        59,050,860,720      idq_uops_not_delivered.core # 11441.447 M/sec                 
            95,968,317      idq_uops_not_delivered.cycles_fe_was_ok #   18.594 M/sec                  
    

    So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never "ok". (i.e. never stalled on the back-end).


    With only 10 CDQ instructions, allowing the DSB to work:

    ...
    0000000000401040 <_start.loop>:
      401040:       99                      cdq    
      401041:       99                      cdq    
    ...
      401049:       99                      cdq        # 10 total CDQ insns
      40104a:       ff cd                   dec    ebp
      40104c:       75 f2                   jne    401040 <_start.loop>
    
     Performance counter stats for './cdq-latency' (4 runs):
    
              1,417.38 msec task-clock                #    1.000 CPUs utilized            ( +-  0.03% )
                     0      context-switches          #    0.000 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                     1      page-faults               #    0.001 K/sec                  
         5,511,283,047      cycles                    #    3.888 GHz                      ( +-  0.03% )  (49.83%)
        11,997,247,694      instructions              #    2.18  insn per cycle           ( +-  0.00% )  (59.99%)
        10,999,182,841      uops_issued.any           # 7760.224 M/sec                    ( +-  0.00% )  (60.17%)
               197,753      frontend_retired.dsb_miss #    0.140 M/sec                    ( +- 13.62% )  (60.21%)
        10,988,958,908      idq.dsb_uops              # 7753.010 M/sec                    ( +-  0.03% )  (60.21%)
            10,234,859      idq.mite_uops             #    7.221 M/sec                    ( +- 27.43% )  (60.21%)
             8,114,909      idq.mite_cycles           #    5.725 M/sec                    ( +- 26.11% )  (39.83%)
            40,588,332      idq_uops_not_delivered.core #   28.636 M/sec                    ( +- 21.83% )  (39.79%)
         5,502,581,002      idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec                    ( +-  0.01% )  (39.79%)
                56,223      idq.all_mite_cycles_4_uops #    0.040 M/sec                    ( +-  3.32% )  (39.79%)
    
              1.417599 +- 0.000489 seconds time elapsed  ( +-  0.03% )
    

    As reported by idq_uops_not_delivered.cycles_fe_was_ok, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.

    0 讨论(0)
提交回复
热议问题