Micro fusion and addressing modes

前端未结

关注

 4  1975

后悔当初 2020-11-21 06:07

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).

The following instruction using [base+index] addressing

4条回答

無奈伤痛 (楼主)

2020-11-21 06:31

Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.

My testing indicates that on Skylake at least¹, the processor fully fuses even complex addressing modes, unlike Sandybridge.

That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.

My results:

Performance counter stats for ./uop-test:

     23.718772      task-clock (msec)         #    0.973 CPUs utilized          
    20,642,233      cycles                    #    0.870 GHz                    
    80,111,957      instructions              #    3.88  insns per cycle        
    60,253,831      uops_executed_thread      # 2540.344 M/sec                  
    80,295,685      uops_issued_any           # 3385.322 M/sec                  
    80,176,940      uops_retired_retire_slots # 3380.316 M/sec                  

   0.024376698 seconds time elapsed

Performance counter stats for ./uop-test x:

     13.532440      task-clock (msec)         #    0.967 CPUs utilized          
    21,592,044      cycles                    #    1.596 GHz                    
    80,073,676      instructions              #    3.71  insns per cycle        
    60,144,749      uops_executed_thread      # 4444.487 M/sec                  
    80,162,360      uops_issued_any           # 5923.718 M/sec                  
    80,104,978      uops_retired_retire_slots # 5919.478 M/sec                  

   0.013997088 seconds time elapsed

Performance counter stats for ./uop-test x x:

     16.672198      task-clock (msec)         #    0.981 CPUs utilized          
    27,056,453      cycles                    #    1.623 GHz                    
    80,083,140      instructions              #    2.96  insns per cycle        
    60,164,049      uops_executed_thread      # 3608.645 M/sec                  
   100,187,390      uops_issued_any           # 6009.249 M/sec                  
   100,118,409      uops_retired_retire_slots # 6005.112 M/sec                  

   0.016997874 seconds time elapsed

I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.

The final test (uop-test x x) is a variant that Peter suggestions which uses a RIP-relative cmp with immediate, which is known not to microfuse:

.loop_riprel
    cmp dword [rel mydata], 1
    cmp dword [rel mydata], 2
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_riprel

The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).

More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.

[1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.

0 讨论(0)

查看其它4个回答