An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

强颜欢笑 提交于 2020-06-16 19:07:14

问题


A following-up question for Why does this `std::atomic_thread_fence` work

As a dummy interlocked operation is better than _mm_mfence, and there are quite many ways to implement it, which interlocked operation and on what data should be used?

Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers.


回答1:


Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that linked question.

lock orb $0, -1(%rsp) is probably a good bet to avoid lengthening dependency chains for local vars that get spilled/reloaded. See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for benchmarks. On Windows x64 (no red zone), that space should be unused except by future call or push instructions.

Store forwarding to the load side of a locked operation might be a thing (if that space was recently used), so keeping the locked operation narrow is good. But being a full barrier, I don't expect there can be any store forwarding from its output to anything else, so unlike normal, a narrow (1 byte) lock orb doesn't have that downside.

mfence is pretty crap compared to a hot line of stack space even on Haswell, probably worse on Skylake where it even blocks OoO exec. (And also bad on AMD compared to lock add).




回答2:


When going the route of interlocked operation on dummy location, there are few things to consider:

  1. Being in L1d of this core,
  2. Being not used by other cores
  3. Not creating long dependency chains
  4. Avoid stall due to store-forwarding miss

Without the context, anything is only a guess, so the goal is to make a best guess.

A place near top of stack is a good guess for 1 and 2.

Deliberately allocated stack variable is likely to fix 3, and as there isn't other stores in flight, 4 is not a problem. The best operation looks like lock not.

Not allocating stack variable requires the operation to be effectively no-op, so lock or [mem], 0 is a good option. Operand should be byte to avoid problems with 4. For 3, it is always a guess. (Although return address could have been used, assembly without the context does not know it. But MSVC _AddressOfReturnAddress may be a good idea)

I've read about red zone. Absence of it on Windows enable extra optimizations.

lock not byte ptr [esp-1] without extra variable is good on Windows, since the data is considered volatile an should not be used. There are no spilled registers, so no false data dependency.

ABI with 128 bytes red zone preclude the use of lock not byte ptr [esp-1]. 128 bytes beyond the stack is likely enough to be not L1d. Still, since red zone not that much likely to be used as the usual stack, the answer given by @Peter Cordes looks good.

TSX is primarily questionable due to its absence (unsupported on a given CPU, or disabled as a result of errata fix or security mitigation). Only RTM will exist in foreseen future (Has Hardware Lock Elision gone forever due to Spectre Mitigation?). According to RTM overview, an empty RTM transaction is still a fence, so it can be used.

A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction.

Beware of failed transactions or unsupported RTM. Pseudocode seem to be as follows:

if (rtm_supported && _xbegin() == 0xFFFFFFFF)
  _xend();
else
  dummy_interlocked_op();


来源:https://stackoverflow.com/questions/62337376/an-implementation-of-stdatomic-thread-fencestdmemory-order-seq-cst-on-x86

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!