Micro fusion and addressing modes

前端 未结 4 1975
后悔当初
后悔当初 2020-11-21 06:07

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).

The following instruction using [base+index] addressing

4条回答
  •  無奈伤痛
    2020-11-21 06:31

    Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.

    My testing indicates that on Skylake at least1, the processor fully fuses even complex addressing modes, unlike Sandybridge.

    That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.

    My results:

    Performance counter stats for ./uop-test:

         23.718772      task-clock (msec)         #    0.973 CPUs utilized          
        20,642,233      cycles                    #    0.870 GHz                    
        80,111,957      instructions              #    3.88  insns per cycle        
        60,253,831      uops_executed_thread      # 2540.344 M/sec                  
        80,295,685      uops_issued_any           # 3385.322 M/sec                  
        80,176,940      uops_retired_retire_slots # 3380.316 M/sec                  
    
       0.024376698 seconds time elapsed
    

    Performance counter stats for ./uop-test x:

         13.532440      task-clock (msec)         #    0.967 CPUs utilized          
        21,592,044      cycles                    #    1.596 GHz                    
        80,073,676      instructions              #    3.71  insns per cycle        
        60,144,749      uops_executed_thread      # 4444.487 M/sec                  
        80,162,360      uops_issued_any           # 5923.718 M/sec                  
        80,104,978      uops_retired_retire_slots # 5919.478 M/sec                  
    
       0.013997088 seconds time elapsed
    

    Performance counter stats for ./uop-test x x:

         16.672198      task-clock (msec)         #    0.981 CPUs utilized          
        27,056,453      cycles                    #    1.623 GHz                    
        80,083,140      instructions              #    2.96  insns per cycle        
        60,164,049      uops_executed_thread      # 3608.645 M/sec                  
       100,187,390      uops_issued_any           # 6009.249 M/sec                  
       100,118,409      uops_retired_retire_slots # 6005.112 M/sec                  
    
       0.016997874 seconds time elapsed
    

    I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.

    The final test (uop-test x x) is a variant that Peter suggestions which uses a RIP-relative cmp with immediate, which is known not to microfuse:

    .loop_riprel
        cmp dword [rel mydata], 1
        cmp dword [rel mydata], 2
        dec ecx
        nop
        nop
        nop
        nop
        jg .loop_riprel
    

    The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).

    More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.


    [1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.

提交回复
热议问题