I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).
The following instruction using [base+index]
addressing >
Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.
My testing indicates that on Skylake at least1, the processor fully fuses even complex addressing modes, unlike Sandybridge.
That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.
My results:
Performance counter stats for ./uop-test
:
23.718772 task-clock (msec) # 0.973 CPUs utilized
20,642,233 cycles # 0.870 GHz
80,111,957 instructions # 3.88 insns per cycle
60,253,831 uops_executed_thread # 2540.344 M/sec
80,295,685 uops_issued_any # 3385.322 M/sec
80,176,940 uops_retired_retire_slots # 3380.316 M/sec
0.024376698 seconds time elapsed
Performance counter stats for ./uop-test x
:
13.532440 task-clock (msec) # 0.967 CPUs utilized
21,592,044 cycles # 1.596 GHz
80,073,676 instructions # 3.71 insns per cycle
60,144,749 uops_executed_thread # 4444.487 M/sec
80,162,360 uops_issued_any # 5923.718 M/sec
80,104,978 uops_retired_retire_slots # 5919.478 M/sec
0.013997088 seconds time elapsed
Performance counter stats for ./uop-test x x
:
16.672198 task-clock (msec) # 0.981 CPUs utilized
27,056,453 cycles # 1.623 GHz
80,083,140 instructions # 2.96 insns per cycle
60,164,049 uops_executed_thread # 3608.645 M/sec
100,187,390 uops_issued_any # 6009.249 M/sec
100,118,409 uops_retired_retire_slots # 6005.112 M/sec
0.016997874 seconds time elapsed
I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.
The final test (uop-test x x
) is a variant that Peter suggestions which uses a RIP-relative cmp
with immediate, which is known not to microfuse:
.loop_riprel
cmp dword [rel mydata], 1
cmp dword [rel mydata], 2
dec ecx
nop
nop
nop
nop
jg .loop_riprel
The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).
More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.
[1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.