An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties
问题 A following-up question for Why does this `std::atomic_thread_fence` work As a dummy interlocked operation is better than _mm_mfence , and there are quite many ways to implement it, which interlocked operation and on what data should be used? Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers. 回答1: Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that