As we know from a previous answer to Does it make any sense instruction LFENCE in processors x86/x86_64? that we can not use SFENCE
instead of MFENCE
In general MFENCE != SFENCE + LFENCE. For example the code below, when compiled with -DBROKEN
, fails on some Westmere and Sandy Bridge systems but appears to work on Ryzen. In fact on AMD systems just an SFENCE seems to be sufficient.
#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
using namespace std;
#define ITERATIONS (10000000)
class minircu {
public:
minircu() : rv_(0), wv_(0) {}
class lock_guard {
minircu& _r;
const std::size_t _id;
public:
lock_guard(minircu& r, std::size_t id) : _r(r), _id(id) { _r.rlock(_id); }
~lock_guard() { _r.runlock(_id); }
};
void synchronize() {
wv_.store(-1, std::memory_order_seq_cst);
while(rv_.load(std::memory_order_relaxed) & wv_.load(std::memory_order_acquire));
}
private:
void rlock(std::size_t id) {
rab_[id].store(1, std::memory_order_relaxed);
#ifndef BROKEN
__asm__ __volatile__ ("mfence;" : : : "memory");
#else
__asm__ __volatile__ ("sfence; lfence;" : : : "memory");
#endif
}
void runlock(std::size_t id) {
rab_[id].store(0, std::memory_order_release);
wab_[id].store(0, std::memory_order_release);
}
union alignas(64) {
std::atomic<uint64_t> rv_;
std::atomic<unsigned char> rab_[8];
};
union alignas(8) {
std::atomic<uint64_t> wv_;
std::atomic<unsigned char> wab_[8];
};
};
minircu r;
std::atomic<int> shared_values[2];
std::atomic<std::atomic<int>*> pvalue(shared_values);
std::atomic<uint64_t> total(0);
void r_thread(std::size_t id) {
uint64_t subtotal = 0;
for(size_t i = 0; i < ITERATIONS; ++i) {
minircu::lock_guard l(r, id);
subtotal += (*pvalue).load(memory_order_acquire);
}
total += subtotal;
}
void wr_thread() {
for (size_t i = 1; i < (ITERATIONS/10); ++i) {
std::atomic<int>* o = pvalue.load(memory_order_relaxed);
std::atomic<int>* p = shared_values + i % 2;
p->store(1, memory_order_release);
pvalue.store(p, memory_order_release);
r.synchronize();
o->store(0, memory_order_relaxed); // should not be visible to readers
}
}
int main(int argc, char* argv[]) {
std::vector<std::thread> vec_thread;
shared_values[0] = shared_values[1] = 1;
std::size_t readers = (argc > 1) ? ::atoi(argv[1]) : 8;
if (readers > 8) {
std::cout << "maximum number of readers is " << 8 << std::endl; return 0;
} else
std::cout << readers << " readers" << std::endl;
vec_thread.emplace_back( [=]() { wr_thread(); } );
for(size_t i = 0; i < readers; ++i)
vec_thread.emplace_back( [=]() { r_thread(i); } );
for(auto &i: vec_thread) i.join();
std::cout << "total = " << total << ", expecting " << readers * ITERATIONS << std::endl;
return 0;
}
What mechanism disables the LFENCE to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?
From the Intel manuals, volume 2A, page 3-464 documentation for the LFENCE
instruction:
LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes
So yes, your example reordering is explicitly prevented by the LFENCE
instruction. Your second example involving only SFENCE
instructions IS a valid reordering, since SFENCE
has no impact on load operations.
MFENCE drains the store buffer before later loads1 can execute.
LFENCE drains the ROB before later instructions can issue into the back-end.
SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not force the store buffer to be drained before it retires so putting LFENCE after it doesn't add up to MFENCE.
(AMD SFENCE is stronger, a full barrier IIRC, but the minimum behaviour across Intel/AMD/Via/etc. is what Intel documents.)
Footnote 1: OoO exec of later stores don't need to be blocked by MFENCE; executing them just writes data into the store buffer. In-order commit already orders them after earlier stores, and commit after retirement orders wrt. loads (because x86 requires loads to complete, not just to start, before they can retire, as part of ensuring load ordering). Remember that x86 hardware is built to disallow reordering other than StoreLoad.
MFENCE does have to prevent NT stores from reordering with other stores, so it has to include whatever SFENCE does, as well as draining the store buffer. And also reordering of weakly-ordered SSE4.1 NT loads from WC memory, which is harder because the normal rules that get load ordering for free no longer apply to those. Guaranteeing this is why a Skylake microcode update strengthened (and slowed) MFENCE to also drain the ROB like LFENCE. It might still be possible for MFENCE to be lighter weight than that with HW support for optionally enforcing ordering of NT loads in the pipeline.
SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only mfence
(or a lock
ed operation, or a real serializing instruction like cpuid
) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.
From Intel's instruction-set reference manual entry for sfence:
The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible.
but
It is not ordered with respect to memory loads or the LFENCE instruction.
LFENCE forces earlier instructions to "complete locally" (i.e. retire from the out-of-order part of the core), but for a store or SFENCE that just means putting data or a marker in the memory-order buffer, not flushing it so the store becomes globally visible. i.e. SFENCE "completion" (retirement from the ROB) doesn't include flushing the store buffer.
This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).
LFENCE is not fully serializing like cpuid
(which is as strong a memory barrier as mfence or a locked instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful with rdtsc
, and for avoiding branch speculation to mitigate Spectre.
BTW, SFENCE is a no-op except for NT stores; it orders them with respect to normal (release) stores. But not with respect to loads or LFENCE. Only on CPU that's normally weakly-ordered does a store-store barrier do anything.
The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.
mov [var1], eax
sfence
lfence
mov eax, [var2]
can become globally visible (i.e. commit to L1d cache) in this order:
lfence
mov eax, [var2] ; load stays after LFENCE
mov [var1], eax ; store becomes globally visible before SFENCE
sfence ; can reorder with LFENCE