问题
Reading a very interesting blog post by Dan Luu about advances in x86 architecture over the past few decades, he says:
If we set
_foo
to 0 and have two threads that both executeincl (_foo)
10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2. If it’s not obvious why the theoretical minimum is 2 and not 10000, figuring that out is a good exercise.
where _foo
is some memory address.
Obviously this is because (as he says farther down) incl
is implemented as a load followed by an add followed by a store. So if you "desugar" it into:
mov reg, _foo ;; #1
inc reg ;; #2
mov _foo, reg ;; #3
Then the following ordering of u-ops results in _foo = 2
:
Thread A executes #1, #2
Thread B executes #1, #2
Thread A executes #3
Thread B executes #3
Thread A executes #1, #2.... etc
(I may be muddling the details of assembler here a little bit, but as far as I know this is a reasonably accurate description of the case where _foo = 2
.)
What I wonder about is his next "exercise:"
[M]y bonus exercise for you is, can any reasonable CPU implementation get that result, or is that some silly thing the spec allows that will never happen? There isn’t enough information in this post to answer the bonus question...
Can it? My instinct is no because I believe that when A executes #3
, then either:
A and B are on the same CPU. B won't get to run until A's timeslice is up, and there's no way that it will take a whole timeslice to execute a single instruction, so eventually someone is going to write out a value > 2, or
A and B are on different CPUs. A's write causes B's cache to become invalidated and A gets to continue executing, writing out a value > 2.
But I'm not positive if every store causes every other cache to get invalidated, or if A is able to continue running during that time, and I'm not sure if OS-level things like timeslices should apply to reasoning about CPUs.
回答1:
tl;dr summary: not possible on a single core with the one-instruction inc [foo]
. Maybe possible with each thread on its own core, but I think only with hyperthreading to create extra delays on stores by causing cache evictions between the load/inc and the store.
I think not even multi-socket cache coherency can be slow enough for B's final store to be delayed 50k cycles after B's final load, but hyperthreading might be able to queue multiple cache/TLB misses ahead of it.
In the single-core case: your assumption that B won't get to run until A's timeslice is up doesn't necessarily hold. An interrupt (e.g. a timer interrupt or NIC) can come in at any point, suspending execution of a user-space thread at any instruction boundary. Perhaps after the interrupt, a higher-priority process wakes up and is scheduled onto the CPU for a while, so there's no reason for the scheduler to prefer the thread that had already run for a fraction of a timeslice.
However, if we're just talking about the single-core case, and concurrency can only happen via context switches, inc [mem]
is very different from mov reg, [mem]
/ inc reg
/ mov [mem], reg
. Regardless of how the CPU internals handle inc [mem]
, a context switch only saves/restores the architectural state. If the load and inc part had already internally completed, but not the store, the whole instruction couldn't have retired. A context switch wouldn't save/restore that progress: the load and inc would have to be re-run when the thread started executing again and the the CPU saw the inc [mem]
instruction again.
If the test had used separate load/inc/store instructions, even a single-core machine could in theory get 2
by the sequence Michael Burr points out:
A loads 0 from _foo
B loops 9999 times (finally storing _foo = 9999)
A stores _foo = 1 (end of first iteration)
B's final iteration loads 1 from _foo
A loops 9999 times (eventually storing _foo = 10000)
B's final iteration stores _foo = 2
This is possible, but would require several context switches triggered by interrupts arriving at extremely specific times. It takes many cycles from an interrupt that causes the scheduler to preempt a thread to the point where the first instruction from a new thread actually runs. There's probably enough time for another interrupt to arrive. We're just interested in it being possible, not likely enough to be observable even after days of trials!
Again, with inc [mem]
, this is impossible on a single core, because context switches can only happen after whole instructions. The CPU's architectural state has either executed the inc
or not.
In a multicore situation, with both threads running at the same time, things are entirely different. Cache coherency operations can happen between the uops that a single instruction is decoded into. So inc [mem]
isn't a single operation in this context.
I'm not sure about this, but I think it might be possible even for a single-instruction inc [foo]
loop to produce a final result of 2. Interrupts / context switches can't account for delays between load and store, though, so we need to come up with other possible reasons.
- A loads 0 from
foo
- B loops 9999 times (finally storing foo = 9999). The cache line is now in the
E
state on B's CPU core - A stores _foo = 1 (end of first iteration). This could conceivably be delayed this long on a hyperthreaded CPU by the other logical thread saturating the store port with a big backlog of stores which miss in cache and/or TLB, and by the store being buffered for a while. Possibly this can happen without hyperthreading with several cache-miss stores waiting to complete. Remember that stores become globally visible in program order in x86's strong memory model, so a later store to a hot cache line still has to wait. Having it complete just in time for B's last iteration is just a coincidence in timing, which is fine.
- B's final iteration loads 1 from foo. (An interrupt or something could delay B's execution of the final iteration. This doesn't require anything to happen between the load/inc/store uops of a single instruction, so I don't need to figure out if receiving a coherency message (from A) that invalidates the cache line will prevent store-forwarding from forwarding the 9999 value from the previous iteration's store to this iteration's load. I'm not sure, but I think it could.)
- A loops 9999 times (eventually storing
_foo = 10000
) B's final iteration stores
_foo = 2
. Explaining how this store can be delayed until after A's loop completes seems like the biggest stretch. Hyperthreading could do it: the other logical core could evict the TLB entry for_foo
, and maybe also evict the L1 D$ line containing the value. This eviction could happen between the load and the store uops for the finalinc
instruction. I'm not sure how long it can take for the coherency protocol to obtain write access to a cache line that's currently owned by another core. I'm sure it's usually far less than 50k cycles, actually less than a main memory access on CPUs with large include last-level caches that act as a backstop for coherency traffic (e.g. Intel's Nehalem and later designs). Very-many-core systems with multiple sockets are potentially slow, but I think they still use a ring bus for coherency traffic.I'm not sure it's plausible for B's final store to be delayed 50k cycles without hyperthreading to pile up some store-port contention and cause cache evictions. The load (which has to see A's store of 1, but not any of A's other stores) can't get too far ahead of the store in the OOO scheduler, since it still has to come after the store from the penultimate iteration. (A core must maintain in-order semantics within a single execution context.)
Since there's only a single memory location that's read and then written in both threads, there isn't any reordering of stores and loads. A load will always see previous stores from the same thread, so it can't become globally visible until after a store to the same location.
On x86, only StoreLoad reordering is possible, but in this case the only thing that matters is that the out-of-order machinery can delay the store for a long time, even without reordering it relative to any loads.
The original blog post you're referring to looks good in general, but I did notice at least one mistake. There are a lot of good links in there.
it turns out that on modern x86 CPUs, using locking to implement concurrency primitives is often cheaper than using memory barriers
That link just shows that using lock add [mem], 0
as a barrier is cheaper on Nehalem, and esp. that it interleaves better with other instructions. It has nothing to say about using locking vs. lockless algorithms that depend on barriers. If you want to atomically increment a memory location, then the simplest choice by far is a lock
ed instruction. Using just MFENCE
would require some kind of separate lock implemented without atomic RMW operations, if that's possible.
Clearly he wanted to introduce the topic of lock inc [mem]
vs. inc [mem]
, and just wasn't careful about the wording. In most cases his generalizations work better.
The example code is also weird, and compiling with -O0
makes quite nasty code as always. I fixed the inline asm to ask the compiler for a memory operand, rather than manually writing incl (reg)
, so with optimization on, it produces incl counter(%rip)
instead of loading the pointer into a register. More importantly, -O3
also avoids keeping the loop counter in memory, even with the original source. -O3
on the original source still appears to produce correct code, even though it doesn't tell the compiler that it writes to memory.
Anyway, flawed as the experiment is, I think the experiment is still valid, and it's unlikely that the huge loop overhead of compiling with -O0
added an artificial limit to the range the final counter could end up with.
Dan Luu's example asm syntax is a weird mix of Intel and AT&T syntax: mov [_foo], %eax
is a load. It should be written mov eax, [_foo]
, or mov _foo, %eax
, or maybe mov (_foo), %eax
if you're trying to make it clear that it's a load rather than a mov-immediate. Anyway, I think it would be confusing if I didn't already know what he meant and was trying to demonstrate.
回答2:
A executes #1, #2
B executes #1, #2, #3 9999 times (_foo == 9999)
A executes #3 (_foo == 1)
B executes #1, #2 (part of iteration 10000, and reg == 2)
A executes #1, #2, #3 9999 times (completing its total of 10000 iterations)
B executes #3 (writing 2 to _foo)
来源:https://stackoverflow.com/questions/34716388/can-any-reasonable-cpu-implementation-give-foo-2-in-this-case