Stores are release operations and loads are acquire operations for both. I know that memory_order_seq_cst
is meant to impose an additional total ordering for al
On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:
seq_cst
stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.acq_rel
does not flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)
But x86 atomic RMW operations always get promoted to seq_cst
because the x86 asm lock
prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm.
https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov
+ mfence
vs. plain mov
in x86 asm. In practice xchg
is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov
+mfence
)
Fun fact: AArch64's STLR release-store instruction is actually a sequential-release. In hardware it has loads/stores with relaxed or seq_cst, as well as a full-barrier instruction.
In theory, STLR only requires draining the store buffer before the next LDAR, not before other operations. i.e. before the next seq_cst load. I don't know if real AArch64 HW implements it this way or if it just drains the store buffer before committing an STLR. (In any case, all earlier stores have to commit before the STLR, but not necessarily before later plain loads.)
So strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive.
Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel
or mo_acq_rel
more cheaply than mo_seq_cst
, but their seq_cst
can't be as cheap as AArch64; seq-cst stores need a full barrier.
Try to build Dekkers or Petersons algorithm with just acquire/release semantics.
That won't work because acquire/release semantics doesn't provide [StoreLoad] fence.
In case of Dekkers algorithm:
flag[self]=1 <-- STORE
while(true){
if(flag[other]==0) { <--- LOAD
break;
}
flag[self]=0;
while(turn==other);
flag[self]=1
}
Without [StoreLoad] fence the store could jump in front of the load and then the algorithm would break. 2 threads at the same time would see that the other lock is free, set their own lock and continue. And now you have 2 threads within the critical section.
Still use the definition and example from memory_order. But replace memory_order_seq_cst with memory_order_release in store and memory_order_acquire in load.
Release-Acquire ordering guarantees everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load. But in our example, nothing happens before store in both thread0 and thread1.
x.store(true, std::memory_order_release); // thread0
y.store(true, std::memory_order_release); // thread1
Further more, without memory_order_seq_cst, the sequential ordering of thread2 and thread3 are not guaranteed. You can imagine they becomes:
if (y.load(std::memory_order_acquire)) { ++z; } // thread2, load y first
while (!x.load(std::memory_order_acquire)); // and then, load x
if (x.load(std::memory_order_acquire)) { ++z; } // thread3, load x first
while (!y.load(std::memory_order_acquire)); // and then, load y
So, if thread2 and thread3 are executed before thread0 and thread1, that means both x and y stay false, thus, ++z is never touched, z stay 0 and the assert fires.
However, if memory_order_seq_cst enters the picture, it establishes a single total modification order of all atomic operations that are so tagged. Thus, in thread2, x.load then y.load; in thread3, y.load then x.load are sure things.
http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst
. Essentially memory_order_acq_rel
provides read and write orderings relative to the atomic variable, while memory_order_seq_cst
provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.
The example boils down to this:
bool x= false;
bool y= false;
int z= 0;
a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }
// kick off a, b, c, d, join all threads
assert(z!=0);
Operations on z
are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z
is always incremented.