I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores.
Does a CPU (I\'m specifically interested in an x86 CPU)
Out-of-order execution preserves the illusion of running in program order for a single thread/core. This is like the C/C++ as-if optimization rule: do whatever you want internally as long as visible effects are the same.
Separate threads can only communicate with each other via memory, so the global order of memory operations (loads/stores) is the only externally visible side-effect of execution1.
Even in-order CPUs can have their memory operations become globally visible out of order. (e.g. even a simple RISC pipeline with a store buffer will have StoreLoad reordering, like x86). A CPU that starts loads / stores in-order but allows them to complete out of order (to hide cache-miss latency) could also reorder loads if it doesn't specifically avoid it (or like modern x86, execute aggressively out-of-order but pretend that it doesn't by tracking memory ordering carefully).
A simple example: two ALU dependency chains can overlap
(related: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about how big the window is for finding instruction-level parallelism, e.g. if you increased this to times 200
you'd see only limited overlap. Also related: this beginner to intermediate-level answer I wrote about how an OoO CPU like Haswell or Skylake finds and exploits ILP.)
For a much deeper analysis of the impact of lfence
here, see Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
global _start
_start:
mov ecx, 10000000
.loop:
times 25 imul eax,eax ; expands to imul eax,eax / imul eax,eax / ...
; lfence
times 25 imul edx,edx
; lfence
dec ecx
jnz .loop
xor edi,edi
mov eax,231
syscall ; sys_exit_group(0)
built (with nasm
+ ld
) into a static executable on x86-64 Linux, this runs (on Skylake) in the expected 750M clock cycles for each chain of 25 * 10M
imul instructions times 3 cycle latency.
Commenting out one of the imul
chains doesn't change the time it takes to run: still 750M cycles.
This is definite proof of out-of-order execution interleaving the two dependency chains, otherwise . (imul
throughput is 1 per clock, latency 3 clocks. http://agner.org/optimize/. So a third dependency chain could be mixed in without much slowdown).
Actual numbers from taskset -c 3 ocperf.py stat --no-big-num -etask-clock,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,uops_retired.retire_slots:u -r3 ./imul
:
750566384 +- 0.1%
750704275 +- 0.0%
times 50 imul eax,eax
chain: 1501010762 +- 0.0%
(almost exactly twice as slow, as expected).lfence
preventing overlap between each block of 25 imul
: 1688869394 +- 0.0%
, worse than twice as slow. uops_issued_any
and uops_retired_retire_slots
are both 63M, up from 51M, while uops_executed_thread
is still 51M (lfence
doesn't use any execution ports, but apparently two lfence
instructions cost 6 fused-domain uops each. Agner Fog only measured 2.)(lfence serializes instruction execution, but not memory stores). If you aren't using NT loads from WC memory (which won't happen by accident), it's a no-op other than stopping later instructions from executing until previous instructions have "completed locally". i.e. until they've retired from the out-of-order core. This is probably why it more than doubles the total time: it has to wait for the last imul
in a block to go through more pipeline stages.)
lfence
on Intel is always like that, but on AMD it's only partially-serializing with Spectre mitigation enabled.
Footnote 1: There are also timing side-channels when two logical threads share one physical thread (hyperthreading or other SMT). e.g. executing a sequence of independent imul
instructions will run at 1 per clock on a recent Intel CPU, if the other hyperthread doesn't need port 1 for anything. So you can measure how much port 0 pressure there is by timing an ALU-bound loop on once logical core.
Other micro-architectural side-channels, such as cache accesses, are more reliable. For example, Spectre / Meltdown are easiest to exploit with a cache-read side-channel, rather than ALU.
But all of these side-channels are finicky and unreliable compared to architecturally-supported reads/writes to shared memory, so they're only relevant for security. They're not used intentionally within the same program for communicating between threads.
mfence
on Skylake unexpectedly blocks out-of-order execution of imul
, like lfence
, even though it's not documented to have that effect. (See the moved-to-chat discussion for more).
xchg [rdi], ebx
(implicit lock
prefix) doesn't block out-of-order execution of ALU instructions at all. The total time is still 750M cycles when replacing lfence
with xchg
or a lock
ed instruction in the above test.
But with mfence
, the cost goes up to 1500M cycles + the time for 2 mfence
instructions. To do a controlled experiment, I kept the instruction-count the same but moved the mfence
instructions next to each other, so the imul
chains could reorder with each other, and the time went down to 750M + the time for 2 mfence
instructions.
This Skylake behaviour is very likely the result of a microcode update to fix erratum SKL079, MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum shows that it used to be possible to execute later instructions before mfence
completed, so probably they did a brute-force fix of adding lfence
uops to the microcode for mfence
.
This is another factor in favour of using xchg
for seq-cst stores, or even lock add
to some stack memory as a stand-alone barrier. Linux already does both of those things, but compilers still use mfence
for barriers. See Why does a std::atomic store with sequential consistency use XCHG?
(See also discussion about Linux's barrier choices on this Google Groups thread, with links to 3 separate recommendations for using lock addl $0, -4(%esp/rsp)
instead of mfence
as a stand-alone barrier.