Can x86 reorder a narrow store with a wider load that fully contains it?

后端 未结 2 442
悲哀的现实
悲哀的现实 2020-11-28 12:43

Intel® 64 and IA-32 Architectures Software Developer’s Manual says:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations<

2条回答
  •  有刺的猬
    2020-11-28 13:24

    Can mov byte [rcx+r8], 1 reorder with the cmp qword [rcx], rdx load that follows it? This is the lock[threadNum]=1 store and the following load to make sure nobody else wrote a byte.

    The load must return data that includes the store, because the executing thread always observes its own actions to happen in program order. (This is true even on weakly-ordered ISAs).


    It turns out this exact locking idea has been proposed before (for the Linux kernel), and Linus Torvalds explained that x86 really does allow this kind of reordering

    Despite the term "store-forwarding failure or stall", it doesn't mean the data has to commit to cache before the load can read it. It actually can be read from the store buffer while the cache line is still in S state (MESI). (And on in-order Atom cores, you don't even get a store-forwarding stall at all.)

    Real hardware does work this way (as Alex's tests show): the CPU will merge data from L1D with data from the store buffer, without committing the store to L1D.

    This by itself isn't reordering yet1 (the load sees the store's data, and they're adjacent in the global order), but it leaves the door open for reordering. The cache line can be invalidated by another core after the load, but before the store commits. A store from another core can become globally visible after our load, but before our store.

    So the load includes data from our own store, but not from the other store from another CPU. The other CPU can see the same effect for its load, and thus both threads enter the critical section.


    1 (This is the point I was making in comments on Alex's answer. If x86 didn't allow this reordering, CPUs could still do the store-forwarding speculatively before the store becomes globally visible, and shoot it down if another CPU invalidated the cache line before the store committed. That part of Alex's answer didn't prove that x86 worked the way it does. Only experimental testing and careful reasoning about the locking algo gave us that.)

    If x86 did disallow this reordering, a store/partially-overlapping-reload pair would work like an MFENCE: Earlier loads can't become globally visible before the load, and earlier stores can't become globally visible before the store. The load has to become globally visible before any following loads or stores, and it would stop the store from being delayed, too.

    Given this reasoning, it's not totally obvious why perfectly-overlapping stores aren't equivalent to an MFENCE as well. Perhaps they actually are, and x86 only manages to make spill/reload or arg-passing on the stack fast with speculative execution!


    The locking scheme:

    It looks like TryLock can fail for both/all callers: They all see it initially zero, they all write their byte, then they all see at least two non-zero bytes each. This is not ideal for heavily-contended locks, compared to using a locked instruction. There is a hardware arbitration mechanism to handle conflicting locked insns. (TODO: find the Intel forum post where an Intel engineer posted this in response to another software retry loop vs. locked instruction topic, IIRC.)

    The narrow-write / wide-read will always trigger a store-forwarding stall on modern x86 hardware. I think this just means the load result isn't ready for several cycles, not that execution of other instructions stalls (at least not in an OOO design).

    In a lightly-contended lock that's used frequently, the branch will be correctly predict to take the no-conflict path. Speculative execution down that path until the load finally completes and the branch can retire shouldn't stall, because store-forwarding stalls are not quite long enough to fill up the ROB.

    • SnB: ~12 cycles longer than when store-forwarding works (~5c)
    • HSW: ~10c longer
    • SKL: ~11c longer than when store-forwarding works (4c for 32 and 64bit operands, which is 1c less than previous CPUs)
    • AMD K8/K10: Agner Fog doesn't give a number.
    • AMD Bulldozer-family: 25-26c (Steamroller)

    • Atom: "Unlike most other processors, the Atom can do store forwarding even if the read operand is larger than the preceding write operand or differently aligned", and there is only 1c latency. Only fails when crossing a cache-line boundary.

    • Silvermont: ~5c extra (base: 7c)
    • AMD Bobcat/Jaguar: 4-11c extra (base: 8c/3c)

    So if the whole locking scheme works, it might do well for lightly-contended locks.

    I think you could turn it into a multiple-readers/single-writer lock by using bit 1 in each byte for readers and bit 2 for writers. TryLock_reader would ignore the reader bits in other bytes. TryLock_writer would work like the original, requiring a zero in all bits in other bytes.


    BTW, for memory ordering stuff in general, Jeff Preshing's blog is excellent.

提交回复
热议问题