What is the difference between load/store relaxed atomic and normal variable?

问题

As I see from a test-case: https://godbolt.org/z/K477q1

The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str

So, is there any difference between relaxed atomic and normal variable?

回答1:

The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile guarantees.

(Pre-C++11, volatile was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)

On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int and long) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.

If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T> exists.

If you read one article on this topic, take a look at the Preshing one here: https://preshing.com/20130618/atomic-vs-non-atomic-operations/

Also try this presentation from CppCon 2017: https://www.youtube.com/watch?v=ZQFzMfHIxng

Links for further reading:

Read a non-atomic variable, atomically?
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering
Causing non-atomics to tear
https://lwn.net/Articles/793895/
What is the (slight) difference on the relaxing atomic rules? which includes a link to a Herb Sutter "atomic weapons" article which is also linked here: https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/

Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181
And a related one about the Linux kernel: https://lwn.net/Articles/793253/

No tearing is only part of what you get with std::atomic<T> - you also avoid data race undefined behaviour.

回答2:

atomic<T> constrains the optimizer to not assume the value is unchanged between accesses in the same thread.

atomic<T> also makes sure the object is sufficiently aligned: e.g. some C++ implementations for 32-bit ISAs have alignof(int64_t) = 4 but alignof(atomic<int64_t>) = 8 to enable lock-free 64-bit operations. (e.g. gcc for 32-bit x86 GNU/Linux). In that case, usually a special instruction is needed that the compiler might not use otherwise, e.g. ARMv8 32-bit ldp load-pair, or x86 SSE2 movq xmm before bouncing to integer regs.

In asm for most ISAs, pure-load and pure-store of naturally-aligned int and long are atomic for free, so atomic<T> with memory_order_relaxed can compile to the same asm as plain variables; atomicity (no tearing) doesn't require any special asm. For example: Why is integer assignment on a naturally aligned variable atomic on x86? Depending on surrounding code, the compiler might not manage to optimize out any accesses to non-atomic objects, in which case code-gen will be the same between plain T and atomic<T> with mo_relaxed.

The reverse is not true: It's not at all safe to write C++ as if you were writing in asm. In C++, multiple threads accessing the same object at the same time is data-race undefined behaviour, unless all the accesses are reads.

Thus C++ compilers are allowed to assume that no other threads are changing a variable in a loop, per the "as-if" optimization rule. If bool done is not atomic, a loop like while(!done) { } will compile into if(!done) infinite_loop;, hoisting the load out of the loop. See Multithreading program stuck in optimized mode but runs normally in -O0 for a detailed example with compiler asm output. (Compiling with optimization disabled is very similar to making every object volatile: memory in sync with the abstract machine between C++ statements for consistent debugging.)

Also obviously RMW operations like += or var.fetch_add(1, mo_seq_cst) are atomic and do have to compile to different asm than non-atomic +=. Can num++ be atomic for 'int num'?

The constraints on the optimizer placed by atomic operations are similar to what volatile does. In practice volatile is a way to roll your own mo_relaxed atomic<T>, but without any easy way to get ordering wrt. other operations. It's de-facto supported on some compilers, like GCC, because it's used by the Linux kernel. However, atomic<T> is guaranteed to work by the ISO C++ standard; When to use volatile with multi threading? - there's almost never a reason to roll your own, just use atomic<T> with mo_relaxed.

Also related: Why don't compilers merge redundant std::atomic writes? / Can and does the compiler optimize out two atomic loads? - compilers currently don't optimize atomics at all, so atomic<T> is currently equivalent to volatile atomic<T>, pending further standards work to provide ways for programmers to control when / what optimization would be ok.

回答3:

Very good question actually, and I asked the same question when I started leaning concurrency.

I'll answer as simple as possible, even though the answer is a bit more complicated.

Reading and writing to the same non atomic variable from different threads* is undefined behavior - one thread is not guaranteed to read the value that the other thread wrote.

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

In fact, atomics are always thread safe, regardless of the memory order! The memory order is not for the atomics -> it's for non atomic data.

Here is the thing - if you use locks, you don't have to think about those low-level things. memory orders are used in lock-free environments where we need to synchronize non atomic data.

Here is the beautiful thing about lock free algorithms, we use atomic operations that are always thread safe, but we "piggy-pack" those operations with memory orders to synchronize the non atomic data used in those algorithms.

For example, a lock-free linked list. Usually, a lock-free link list node looks something like this:

Node:
   Atomic<Node*> next_node;
   T non_atomic_data

Now, let's say I push a new node into the list. next_node is always thread safe, another thread will always see the latest atomic value. But who grantees that other threads see the correct value of non_atomic_data?

No-one.

Here is a perfect example of the usage of memory orders - we "piggyback" atomic stores and loads to next_node by also adding memory orders that synchronize the value of non_atomic_data.

So when we store a new node to the list, we use memory_order_release to "push" the non atomic data to the main memory. when we read the new node by reading next_node, we use memory_order_acquire and then we "pull" the non atomic data from the main memory. This way we assure that both next_node and non_atomic_data are always synchronized across threads.

memory_order_relaxed doesn't synchronize any non-atomic data, it synchronizes only itself - the atomic variable. When this is used, developers can assume that the atomic variable doesn't reference any non-atomic data published by the same thread that wrote the atomic variable. In other words, that atomic variable isn't, for example, an index of a non-atomic array, or a pointer to non atomic data, or an iterator to some non-thread safe collection. (It would be fine to use relaxed atomic stores and loads for an index into a constant lookup table, or one that's synchronized separately. You only need acq/rel synchronization if the pointed-to or indexed data was written by the same thread.) This is faster (at least on some architectures) than using stronger memory orders but can be used in fewer cases.

Great, but even this is not the full answer. I said memory orders are not used for atomics. I was half-lying.

With relaxed memory order, atomics are still thread safe. but they have a downside - they can be re-ordered. look at the following snippet:

a.store(1, std::memory_order_relaxed);
b.store(2, std::memory_order_relaxed);

In reality, a.store can happen after b.store. The CPU does this all the times, it's called Out of Order Execution and its one of the optimizations techniques CPUs use to speed up execution. a and b are still thread-safe, even though the thread-safe stores might happen in a reverse order.

Now, what happens if there is a meaning for the order? Many lock-free algorithms depend on the order of atomic operations for their correctness.

Memory orders are also used to prevent reordering. This is why memory orders are so complicated, because they do 2 things at the same time.

memory_order_acquire tells the compiler and CPU not to execute operations that happen after it code-wise, before it.

similarity, memory_order_release tells the compiler and CPU not to execute operations that before it code-wise, after it.

memory_order_relaxed tells the compiler/cpu that the atomic operation can be re-ordered is possible, in a similar way non atomic operations are reordered whenever possible.

来源：https://stackoverflow.com/questions/63810298/what-is-the-difference-between-load-store-relaxed-atomic-and-normal-variable

标签

c++

c++11

atomic

memory-barriers

stdatomic