I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence
and xchg
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg
is as strong as mfence
.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg
is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem]
weaker then mfence
(on Pentium4? Probably also on Sandybridge-family).
mfence
does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg
/ lock
, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem]
on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg
performs better for seq-cst stores than mov
+mfence
on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg
vs. mfence
situation is the same on AMD. I'm sure there's no correctness problem with using xchg
as a seq-cst store, because that's what compilers other than gcc actually do.