C++20 std::atomic- std::atomic.specializations

问题

C++20 includes specializations for atomic<float> and atomic<double>. Can anyone here explain for what practical purpose this should be good for? The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously (but a volatile double or float should in fact do the same on most platforms). But the need for this should be extremely rare. I think this rare case couldn't justify an inclusion into the C++20 standard.

回答1:

EDIT: Adding Ulrich Eckhardt's comment to clarify: 'Let me try to rephrase that: Even if volatile on one particular platform/environment/compiler did the same thing as atomic<>, down to the generated machine code, then atomic<> is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that.'

Volatile sometimes has the below 2 effects:

Prevents compilers from caching the value in a register.
Prevents optimizing away accesses to that value when they seem unnecessary from the POV of your program.

See also Understanding volatile keyword in c++

TLDR;

Be explicit about what you want.

Do not rely on 'volatile' do do what you want, if 'what' is not the original purpose of volatile, e.g. enabling external sensors or DMA to change a memory address without the compiler interfering.
If you want an atomic, use std::atomic.
If you want to disable strict aliasing optimizations, do like the Linux kernel, and disable strict aliasing optimizations on e.g. gcc.
If you want to disable other kinds of compiler optimizations, use compiler intrinsics or code explicit assembly for e.g ARM or x86_64.
If you want 'restrict' keyword semantics like in C, use the corresponding restrict intrinsic in C++ on your compiler, if available.
In short, do not rely on compiler- and CPU-family dependent behavior if constructs provided by the standard are clearer and more portable. Use e.g. godbolt.org to compare the assembler output if you believe your 'hack' is more efficient than doing it the right way.

From std::memory_order

Relationship with volatile

Within a thread of execution, accesses (reads and writes) through volatile glvalues cannot be reordered past observable side-effects (including other volatile accesses) that are sequenced-before or sequenced-after within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization.

In addition, volatile accesses are not atomic (concurrent read and write is a data race) and do not order memory (non-volatile memory accesses may be freely reordered around the volatile access).

One notable exception is Visual Studio, where, with default settings, every volatile write has release semantics and every volatile read has acquire semantics (MSDN), and thus volatiles may be used for inter-thread synchronization. Standard volatile semantics are not applicable to multithreaded programming, although they are sufficient for e.g. communication with a std::signal handler that runs in the same thread when applied to sig_atomic_t variables.

As a final rant: In practice, the only feasible languages for building an OS kernel are usually C and C++. Given that, I would like provisions in the 2 standards for 'telling the compiler to butt out', i.e. to be able to explicitly tell the compiler to not change the 'intent' of the code. The purpose would be to use C or C++ as a portable assembler, to an even greater degree than today.

An somewhat silly code example is worth compiling on e.g. godbolt.org for ARM and x86_64, both gcc, to see that in the ARM case, the compiler generates two __sync_synchronize (HW CPU barrier) operations for the atomic, but not for the volatile variant of the code (uncomment the one you want). The point being that using atomic gives predictable, portable behavior.

#include <inttypes.h>
#include <atomic>

std::atomic<uint32_t> sensorval;
//volatile uint32_t sensorval;

uint32_t foo()
{
    uint32_t retval = sensorval;
    return retval;
}
int main()
{
    return (int)foo();
}

Godbolt output for ARM gcc 8.3.1:

foo():
  push {r4, lr}
  ldr r4, .L4
  bl __sync_synchronize
  ldr r4, [r4]
  bl __sync_synchronize
  mov r0, r4
  pop {r4, lr}
  bx lr
.L4:
  .word .LANCHOR0

For those who want an X86 example, a colleague of mine, Angus Lepper, graciously contributed this example: godbolt example of bad volatile use on x86_64

回答2:

atomic<float> and atomic<double> have existed since C++11. The atomic<T> template works for arbitrary trivially-copyable T. Everything you could hack up with legacy pre-C++11 use of volatile for shared variables can be done with C++11 atomic<double> with std::memory_order_relaxed.

What doesn't exist until C++20 are atomic RMW operations like x.fetch_add(3.14); or for short x += 3.14. (Why isn't atomic double fully implemented wonders why not). Those member functions were only available in the atomic integer specializations, so you could only load, store, exchange, and CAS on float and double, like for arbitrary T like class types.

See Atomic double floating point or SSE/AVX vector load/store on x86_64 for details on how to roll your own with compare_exchange_weak, and how that (and pure load, pure store, and exchange) compiles in practice with GCC and clang for x86. (Not always optimal, gcc bouncing to integer regs unnecessarily.) Also for details on lack of atomic<__m128i> load/store because vendors won't publish real guarantees to let us take advantage (in a future-proof way) of what current HW does.

These new specializations provide maybe some efficiency (on non-x86) and convenience with fetch_add and fetch_sub (and the equivalent += and -= overloads). Only those 2 operations that are supported, not fetch_mul or anything else. See the current draft of 31.8.3 Specializations for floating-point types, and cppreference std::atomic

It's not like the committee went out of their way to introduce new FP-relevant atomic RMW member functions fetch_mul, min, max, or even absolute value or negation, which is ironically easier in asm, just bitwise AND or XOR to clear or flip the sign bit and can be done with x86 lock and if the old value isn't needed. Actually since carry-out from the MSB doesn't matter, 64-bit lock xadd can implement fetch_xor with 1ULL<<63. Assuming of course IEEE754 style sign/magnitude FP. Similarly easy on LL/SC machines that can do 4-byte or 8-byte fetch_xor, and they can easily keep the old value in a register.

So the one thing that could be done significantly more efficiently in x86 asm than in portable C++ without union hacks (atomic bitwise ops on FP bit patterns) still isn't exposed by ISO C++.

It makes sense that the integer specializations don't have fetch_mul: integer add is much cheaper, typically 1 cycle latency, the same level of complexity as atomic CAS. But for floating point, multiply and add are both quite complex and typically have similar latency. Moreover, if atomic RMW fetch_add is useful for anything, I'd assume fetch_mul would be, too. Again unlike integer where lockless algorithms commonly add/sub but very rarely need to build an atomic shift or mul out of a CAS. x86 doesn't have memory-destination multiply so has no direct HW support for lock imul.

It seems like this is more a matter of bringing atomic<double> up to the level you might naively expect (supporting .fetch_add and sub like integers), not of providing a serious library of atomic RMW FP operations. Perhaps that makes it easier to write templates that don't have to check for integral, just numeric, types?

Can anyone here explain for what practical purpose this should be good for?

For pure store / pure load, maybe some global scale factor that you want to be able to publish to all threads with a simple store? And readers load it before every work unit or something. Or just as part of a lockless queue or stack of double.

It's not a coincidence that it took until C++20 for anyone to say "we should provide fetch_add for atomic<double> in case anyone wants it."

Plausible use-case: to manually multi-thread the sum of an array (instead of using #pragma omp parallel for simd reduction(+:my_sum_variable) or a standard <algorithm> like std::accumulate with a C++17 parallel execution policy).

The parent thread might start with atomic<double> total = 0; and pass it by reference to each thread. Then threads do *totalptr += sum_region(array+TID*size, size) to accumulate the results. Instead of having a separate output variable for each thread and collecting the results in one caller. It's not bad for contention unless all threads finish at nearly the same time. (Which is not unlikely, but it's at least a plausible scenario.)

If you just want separate load and separate store atomicity like you're hoping for from volatile, you already have that with C++11.

Don't use `volatile` for threading: use `atomic<T>` with `mo_relaxed`

See When to use volatile with multi threading? for details on mo_relaxed atomic vs. legacy volatile for multithreading. volatile data races are UB, but it does work in practice as part of roll-your-own atomics on compilers that support it, with inline asm needed if you want any ordering wrt. other operations, or if you want RMW atomicity instead of separate load / ALU / separate store. All mainstream CPUs have coherent cache/shared memory. But with C++11 there's no reason to do that: std::atomic<> obsoleted hand-rolled volatile shared variables.

At least in theory. In practice some compilers (like GCC) still have missed-optimizations for atomic<double> / atomic<float> even for just simple load and store. (And the C++20 new overloads aren't implemented yet on Godbolt). atomic<integer> is fine though, and does optimize as well as volatile or plain integer + memory barriers.

In some ABIs (like 32-bit x86), alignof(double) is only 4. Compilers normally align it by 8 but inside structs they have to follow the ABI's struct packing rules so an under-aligned volatile double is possible. Tearing will be possible in practice if it splits a cache-line boundary, or on some AMD an 8-byte boundary. atomic<double> instead of volatile can plausibly matter for correctness on some real platforms, even when you don't need atomic RMW. e.g. this G++ bug which was fixed by increasing using alignas() in the std::atomic<> implementation for objects small enough to be lock_free.

(And of course there are platforms where an 8-byte store isn't naturally atomic so to avoid tearing you need a fallback to a lock. If you care about such platforms, a publish-occasionally model should use a hand-rolled SeqLock or atomic<float> if atomic<double> isn't always_lock_free.)

You can get the same efficient code-gen (without extra barrier instructions) from atomic<T> using mo_relaxed as you can with volatile. Unfortunately in practice, not all compilers have efficient atomic<double>. For example, GCC9 for x86-64 copies from XMM to general-purpose integer registers.

#include <atomic>

volatile double vx;
std::atomic<double> ax;
double px; // plain x

void FP_non_RMW_increment() {
    px += 1.0;
    vx += 1.0;     // equivalent to vx = vx + 1.0
    ax.store( ax.load(std::memory_order_relaxed) + 1.0, std::memory_order_relaxed);
}

#if __cplusplus > 201703L    // is there a number for C++2a yet?
// C++20 only, not yet supported by libstdc++ or libc++
void atomic_RMW_increment() {
    ax += 1.0;           // seq_cst
    ax.fetch_add(1.0, std::memory_order_relaxed);   
}
#endif

Godbolt GCC9 for x86-64, gcc -O3. (Also included an integer version)

FP_non_RMW_increment():
        movsd   xmm0, QWORD PTR .LC0[rip]   # xmm0 = double 1.0 

        movsd   xmm1, QWORD PTR px[rip]        # load
        addsd   xmm1, xmm0                     # plain x += 1.0
        movsd   QWORD PTR px[rip], xmm1        # store

        movsd   xmm1, QWORD PTR vx[rip]
        addsd   xmm1, xmm0                     # volatile x += 1.0
        movsd   QWORD PTR vx[rip], xmm1

        mov     rax, QWORD PTR ax[rip]      # integer load
        movq    xmm2, rax                   # copy to FP register
        addsd   xmm0, xmm2                     # atomic x += 1.0
        movq    rax, xmm0                   # copy back to integer
        mov     QWORD PTR ax[rip], rax      # store

        ret

clang compiles it efficiently, with the same move-scalar-double load and store for ax as for vx and px.

Fun fact: C++20 apparently deprecates vx += 1.0. Perhaps this is to help avoid confusion between separate load and store like vx = vx + 1.0 vs. atomic RMW? To make it clear there are 2 separate volatile accesses in that statement?

<source>: In function 'void FP_non_RMW_increment()':
<source>:9:8: warning: compound assignment with 'volatile'-qualified left operand is deprecated [-Wvolatile]
    9 |     vx += 1.0;     // equivalent to vx = vx + 1.0
      |     ~~~^~~~~~

Note that x = x + 1 is not the same thing as x += 1 for atomic<T> x: the former loads into a temporary, adds, then stores. (With sequential-consistency for both).

回答3:

The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously

Yes, this is the only purpose of an atomic regardless of the actual type. may it be an atomic bool, char, int, long or whatever.

Whatever usage you have for type, std::atomic<type> is a thread-safe version of it. Whatever usage you have for a float or a double, std::atomic<float/double> can be written, read or compared with a thread-safe manner.

saying that std::atomic<float/double> has only rare usages is practically saying that float/double have rare usages.

来源：https://stackoverflow.com/questions/58680928/c20-stdatomicfloat-stdatomicdouble-specializations

标签

c++