The C++11 standard defines a memory model (1.7, 1.10) which contains memory orderings, which are, roughly, \"sequentially-consistent\", \"acquire\", \"consume\", \"rele
Load-consume is much like load-acquire, except that it induces happens-before relationships only to expression evaluations that are data-dependent on the load-consume. Wrapping an expression with kill_dependency
results in a value that no longer carries a dependency from the load-consume.
The key use case is for the writer to construct a data structure sequentially, then swing a shared pointer to the new structure (using a release
or acq_rel
atomic). The reader uses load-consume to read the pointer, and dereferences it to get to the data structure. The dereference creates a data dependency, so the reader is guaranteed to see the initialized data.
std::atomic<int *> foo {nullptr};
std::atomic<int> bar;
void thread1()
{
bar = 7;
int * x = new int {51};
foo.store(x, std::memory_order_release);
}
void thread2()
{
int *y = foo.load(std::memory_order_consume)
if (y)
{
assert(*y == 51); //succeeds
// assert(bar == 7); //undefined behavior - could race with the store to bar
// assert(kill_dependency(*y) + bar == 58) // undefined behavior (same reason)
assert(*y + bar == 58); // succeeds - evaluation of bar pulled into the dependency
}
}
There are two reasons for providing load-consume. The primary reason is that ARM and Power loads are guaranteed to consume, but require additional fencing to turn them into acquires. (On x86, all loads are acquires, so consume provides no direct performance advantage under naive compilation.) The secondary reason is that the compiler can move later operations without data dependence up to before the consume, which it can't do for an acquire. (Enabling such optimizations is the big reason for building all of this memory ordering into the language.)
Wrapping a value with kill_dependency
allows computation of an expression that depends on the value to be moved to before the load-consume. This is useful e.g. when the value is an index into an array that was previously read.
Note that the use of consume results in a happens-before relation that is no longer transitive (though it is still guaranteed to be acyclic). For example, the store to bar
happens before the store to foo, which happens before the dereference of y
, which happens before the read of bar
(in the commented-out assert), but the store to bar
doesn't happen before the read of bar
. This leads to a rather more complicated definition of happens-before, but you can imagine how it works (start with sequenced-before, then propagate through any number of release-consume-dataDependency or release-acquire-sequencedBefore links)
Jeff Preshing has a great blog post answering this question. I can't add anything myself, but think anyone wondering about consume vs. acquire should read his post:
http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
He shows a specific C++ example with corresponding benchmarked assembly code across three different architectures. Compared to memory_order_acquire
, memory_order_consume
potentially offers a 3x speedup on PowerPC, 1.6x speedup on ARM, and negligible speedup on x86 which has strong consistency anyway. The catch is that as of when he wrote it, only GCC actually treated consume semantics any differently from acquire, and probably because of a bug. Nonetheless, it demonstrates that a speedup is available if the compiler writers can figure out how to take advantage of it.
I'd like to record a partial finding, even though it's not a real answer and doesn't mean that there won't be a big bounty for a proper answer.
After staring at 1.10 for a while, and in particular the very helpful note in paragraph 11, I think this isn't actually so hard. The big difference between synchronizes-with (henceforth: s/w) and dependency-ordered-before (dob) is that a happens-before relationship can be established by concatenating sequenced-before (s/b) and s/w arbitrarily, but not so for dob. Note one of the definitions for inter-thread happens before:
A
synchronizes-withX
andX
is sequenced beforeB
But the analogous statement for is missing!A
is dependency-ordered before X
So with release/acquire (i.e. s/w) we can order arbitrary events:
A1 s/b B1 Thread 1
s/w
C1 s/b D1 Thread 2
But now consider an arbitrary sequence of events like this:
A2 s/b B2 Thread 1
dob
C2 s/b D2 Thread 2
In this sequenece, it is still true that A2
happens-before C2
(because A2
is s/b B2
and B2
inter-thread happens before C2
on account of dob; but we could argue that you can never actually tell!). However, it is not true that A2
happens-before D2
. The events A2
and D2
are not ordered with respect to one another, unless it actually holds that C2
carries dependency to D2
. This is a stricter requirement, and absent that requirement, A2
-to-D2
cannot be ordered "across" the release/consume pair.
In other words, a release/consume pair only propagates an ordering of actions which carry a dependency from one to the next. Everything that's not dependent is not ordered across the release/consume pair.
Furthermore, note that the ordering is restored if we append a final, stronger release/acquire pair:
A2 s/b B2 Th 1
dob
C2 s/b D2 Th 2
s/w
E2 s/b F2 Th 3
Now, by the quoted rule, D2
inter-thread happens before F2
, and therefore so do C2
and B2
, and so A2
happens-before F2
. But note that there is still no ordering between A2
and D2
— the ordering is only between A2
and later events.
In summary and in closing, dependency carrying is a strict subset of general sequencing, and release/consume pairs provide an ordering only among actions that carry dependency. As long as no stronger ordering is required (e.g. by passing through a release/acquire pair), there is theoretically a potential for additional optimization, since everything that is not in the dependency chain may be reordered freely.
Maybe here is an example that makes sense?
std::atomic<int> foo(0);
int x = 0;
void thread1()
{
x = 51;
foo.store(10, std::memory_order_release);
}
void thread2()
{
if (foo.load(std::memory_order_acquire) == 10)
{
assert(x == 51);
}
}
As written, the code is race-free and the assertion will hold, because the release/acquire pair orderes the store x = 51
before the load in the assertion. However, by changing "acquire" into "consume", this would no longer be true and the program would have a data race on x
, since x = 51
carries no dependency into the store to foo
. The optimization point is that this store can be reordered freely without concern to what foo
is doing, because there is no dependency.
Data dependency ordering was introduced by N2492 with the following rationale:
There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware.
- read access to rarely written concurrent data structures
Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common.
- publish-subscribe semantics for pointer-mediated publication
Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics.
In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.
emphasis mine
the motivating use case presented there is rcu_dereference()
from the Linux kernel