Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

问题

C++11 specifies six memory orderings:

typedef enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
} memory_order;

https://en.cppreference.com/w/cpp/atomic/memory_order

where the default is seq_cst.

Performance gains can be found by relaxing the memory ordering of operations. However, this depends on what protections the architecture provides. For example, Intel x86 is a strong memory model and guarantees that various loads/store combinations will not be re-ordered.

As such relaxed, acquire and release seem to be the only orderings required when seeking additional performance on x86.

Is this correct? If not, is there ever a need to use consume, acq_rel and seq_cst on x86?

回答1:

If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst for a pure store, so make a point of avoiding that even for x86.

(relaxed ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)

Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire or release, so in practice it's safer to leave your RMWs as seq_cst if you're doing anything that's already complicated and hard to reason about correctness.

There are very few use-cases where seq_cst is required (draining the store buffer before later loads can happen). Almost always a weaker order like acquire or release would also be safe.

There are artificial cases like https://preshing.com/20120515/memory-reordering-caught-in-the-act/, but even implementing locking generally only requires acquire and release ordering. (Of course taking a lock does require an atomic RMW, so on x86 that might as well be seq_cst.) One practical use-case I came up with was to have multiple threads set bits in an array. Avoid atomic RMWs and detect when one thread stepped on another by re-checking values that were recently stored. You have to wait until your stores are globally visible before you can safely reload them to check.

As such relaxed, acquire and release seem to be the only orderings required on x86.

From one POV, in C++ source you don't require any ordering weaker than seq_cst (except for performance); that's why it's the default for all std::atomic functions. Remember you're writing C++, not x86 asm.

Or if you mean to describe the full range of what x86 asm can do, then it's acq for loads, rel for pure stores, and seq_cst for atomic RMWs. (The lock prefix is a full barrier; fetch_add(1, relaxed) compiles to the same asm as seq_cst). x86 asm can't do a relaxed load or store¹.

The only benefit to using relaxed in C++ (when compiling for x86) is to allow more optimization of surrounding non-atomic operations by reordering at compile time, e.g. to allow optimizations like store coalescing and dead-store elimination. Always remember that you're not writing x86 asm; the C++ memory model applies for compile-time ordering / optimization decisions.

acq_rel and seq_cst are nearly identical for atomic RMW operations in ISO C++, I think no difference when compiling for ISAs like x86 and ARMv8 that are multi-copy-atomic. (No IRIW reordering like e.g. POWER can do by store-forwarding between SMT threads before a store commits to L1d). How do memory_order_seq_cst and memory_order_acq_rel differ?

For barriers, atomic_thread_fence(mo_acq_rel) compiles to zero instructions on x86, while fence(seq_cst) compiles to mfence or a faster equivalent (e.g. a dummy locked instruction on some stack memory). When is a memory_order_seq_cst fence useful?

You could say acq_rel and consume are truly useless if you're only compiling for x86. consume was intended to expose the dependency ordering that most weakly-ordered ISAs do (notably not DEC Alpha). But unfortunately it was designed in a way that compilers couldn't implement safely so they currently just give up and promote it to acquire, which costs a barrier on some weakly-ordered ISAs. But on x86, acquire is "free" so it's fine.

If you actually do need efficient consume, e.g. for RCU, your only real option is to use relaxed and don't give the compiler enough information to optimize away the data dependency from the asm it makes. C++11: the difference between memory_order_relaxed and memory_order_consume.

Footnote 1: I'm not counting movnt as a relaxed atomic store because the usual C++ -> asm mapping for release operations uses just a mov store, not sfence, and thus would not order an NT store. i.e. std::atomic leaves it up to you to use _mm_sfence() if you'd been messing around with _mm_stream_ps() stores.

PS: this entire answer is assuming normal WB (write-back) cacheable memory regions. If you just use C++ normally under a mainstream OS, all your memory allocations will be WB, not weakly-ordered WC or strongly-ordered uncacheable UC or anything else. In fact even if you wanted a WC mapping of a page, most OSes don't have an API for that. And std::atomic release stores would be broken on WC memory, weakly-ordered like NT stores.

来源：https://stackoverflow.com/questions/61719680/are-memory-orderings-consume-acq-rel-and-seq-cst-ever-needed-on-intel-x86

标签

c++

performance

optimization

x86

stdatomic