How do I make memory stores in one thread “promptly” visible in other threads?

问题

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:

#include <atomic>

volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);

// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;

// ...

// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
  .store(*Device_reg_ptr, std::memory_order_relaxed);

// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);

More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?

EDIT: In your answer, please consider the following scenario:

The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.

How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?

回答1:

The C++ standard is rather vague about making atomic stores visible to other threads..

29.3.12 Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.

Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is, what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner. You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(), you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship. Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.

If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value. These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value. But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.

As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads. That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:

// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);


// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);

If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2. Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1, there are no ordering guarantees.

If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.

As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

回答2:

If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.

There is no need to apply volatile to atomic variables, it is unnecessary.

std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.

However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.

Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.

This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.

I cannot recommend enough CPU Cache Flushing Fallacy article.

来源：https://stackoverflow.com/questions/42003798/how-do-i-make-memory-stores-in-one-thread-promptly-visible-in-other-threads

标签

c++

multithreading

c++11

atomic

stdatomic