I have found out that an x86 CPU have the following memory barriers instructions: mfence
, lfence
, and sfence
.
Does an x86 CPU
You are right, the only three memory barrier functions on the x86 CPU are:
LFENCE
SFENCE
MFENCE
sfence
(SSE1) and mfence
/ lfence
(SSE2) are the only instructions that are named for their memory fence/barrier functionality. Unless you're using NT loads or stores and/or WC memory, only mfence
is needed for memory ordering.
(Note that lfence on Intel CPUs is also a barrier for out-of-order execution, so it can serialize rdtsc
, and is useful for Spectre mitigation to prevent speculative execution. On AMD, there's an MSR that has to be set, otherwise lfence
is basically a nop
(4/cycle throughput). That MSR was introduced with Spectre-mitigation microcode updates, and is normally set by updated kernels.)
lock
ed instructions like lock add [mem], eax
are also full memory barriers. Does lock xchg have the same behavior as mfence?. (Although possibly not as strong as mfence
for ordering NT loads from WC memory: Do locked instructions provide a barrier between weakly-ordered accesses?). xchg [mem], reg
has an implicit lock
prefix, so it's also a barrier.
In my testing on Skylake, lock
ed instructions do block reordering of NT stores with regular stores with this code https://godbolt.org/g/7Q9xgz.
xchg
seems to be a good way to do a seq-cst store, especially on Intel hardware like Skylake where mfence
also blocks out-of-order execution of pure ALU instructions, like lfence
: See the bottom of this answer.
AMD also recommends using xchg
or other locked instructions instead of mfence
. (mfence
is documented in the AMD manuals as serializing on AMD, so it will always have the penalty of blocking OoO exec).
For sequential-consistency stores or full barriers on 32-bit targets without SSE, compilers typically use lock or [esp], 0
or other no-op locked instruction just for the memory-barrier effect. That's what g++7.3 -O3 -m32 -mno-sse does for std::atomic_thread_fence(std::memory_order_seq_cst);
.
But anyway, neither mfence
nor lock
ed insns are architecturally defined as serializing on Intel, regardless of implementation details on some CPUs.
Full serializing instructions like cpuid
are also full memory barriers, draining the store buffer as well as flushing the pipeline. Does lock xchg have the same behavior as mfence? has relevant quotes from Intel's manual.
On Intel processors, the following are architecturally serializing instructions (From: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html):
Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, and WRMSR.
Exceptions: MOV CR8
isn't serializing. WRMSR
to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.
Non-privileged serializing instructions — CPUID, IRET1, and RSM
On AMD processors, the following are architecturally serializing instructions:
Privileged serializing instructions — INVD, INVLPG, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, WRMSR, and SWAPGS.
Non-privileged serializing instructions — MFENCE, CPUID, IRET, and RSM
The term "[fully] serializing instruction" on Intel processors means the same exact thing as on AMD processors except for one difference: a cache line flushing operation from CLFLUSH
(but not CLFLUSHOPT
) is ordered with respect to later instructions by only MFENCE
on AMD processors.
in
/ out (and their string-copy versions ins
and outs) are full memory barriers, and also partially serializing (like lfence
). The docs say they delay execution of the next instruction until after "the data phase" of the I/O transaction.
Footnotes:
(1) According to BJ137 (Sandy Bridge), HSD152 (Haswell), BDM103 (Broadwell):
Problem: An IRET instruction that results in a task switch by returning from a nested task does not serialize the processor (contrary to the Software Developer’s Manual Vol. 3 section titled "Serializing Instructions").
Implication: Software which depends on the serialization property of IRET during task switching may not behave as expected. Intel has not observed this erratum to impact the operation of any commercially available software.
Workaround: None identified. Software can execute an MFENCE instruction immediately prior to the IRET instruction if serialization is needed.