Are X86 atomic RMW instructions wait free

问题

On x86, atomic RMW instructions like lock add dword [rdi], 1 are implemented using cache locking on modern CPUs. So a cache line is locked for duration of the instruction. This is done by getting the line EXCLUSIVE/MODIFIED state when value is read and the CPU will not respond to MESI requests from other CPU's until the instruction is finished.

There are 2 flavors of concurrent progress conditions, blocking and non-blocking. Atomic RMW instructions are non-blocking. CPU hardware will never sleep or do something else while holding a cache lock (an interrupt happens before or after an atomic RMW, not during), there is a finite (and small) upper bound on the number of steps before a cache line is released.

Non blocking algorithms can be split in 3 flavors in theoretical computer science:

wait free: all threads will make progress in a finite number of steps.
lock free: at least one thread will make progress in a finite number of steps
obstruction free: if there is no contention, a thread will make progress in a finite number of steps

What kind of guarantee does x86 provide?

I guess it is at least lock free; if there is contention, at least one CPU will make progress.

But is x86 wait free for atomic instructions? Is every CPU guaranteed to make progress in a finite number of steps or could it be that one or more CPU's are starved and could potentially be delayed indefinitely?

So what happens when there are multiple cores doing atomic operations on the same cache line?

回答1:

When multiple threads happen to lock the same cache line, their execution is serialized. This is called write contention due to false sharing.

The single-writer principle stems from this. Writes cannot be performed concurrently, as opposed to reads.

From 1024cores.net:

atomic RMW operations have some fixed associated costs. For modern Intel x86 processors cost of a single atomic RMW operation (LOCK prefixed instruction) is some 40 cycles (depends on a particular model, and steadily decreases). [...] However, the cost is fixed [...]

From Intel Community:

In some architectures, operations that are not chosen to go first will be stalled (then retried by the hardware until they succeed), while in other architectures they will "fail" (for software-based retry). In an Intel processor, for example, a locked ADD instruction will be retried by the hardware if the target memory location is busy, while a locked "compare and exchange" operation must be checked to see if it succeeded (so the software must notice the failure and retry the operation).

The upper bound of the time it takes to for example lock xadd a memory location or multiple memory locations on the same cache line is proportional to how much contention the cache line experiences.

Since the instruction itself will be continuously retried, eventually all of them will succeed.
So yes, every CPU is guaranteed to make progress in a finite number of steps, and the instruction "algorithm" as a whole, which is the instruction plus the retrying done by the hardware to lock the cache line, on x86 is wait-free.

The execution time of the instruction itself does not depend on the number of threads contending on the cache line. Therefore atomic read-modify-write instructions themselves on x86 are wait-free population-oblivious.

By the same logic, the x86 store "algorithm" is wait-free, x86 store instruction is wait-free population-oblivious, and x86 load instruction is wait-free population-oblivious.

While, as someone suggests, a ucode bug could cause the lock to stay on forever, we do not consider external factors when describing the flavor of an algorithm, but only the logic itself.

Cache line lock acquisition is not fair.

The probability that a thread is selected to acquire the lock is proportional to how close it is to the thread that released the lock. So, threads on the same core are more likely to acquire the lock than threads that share the L2 cache, which are more likely than threads that share the L3 cache. Then, threads on shorter QPI/UPI/NUMA Node paths have an edge over others, and so on.

This holds true for software locks (spin locks) too, since when a release store is issued, it propagates the same way.

I ran a benchmark on Intel i7 8700 (6c / 12t) that confirms all of the above.
When continuously lock xadding over the same memory location...

for 10 seconds, out of 5 threads running on different cores the fastest thread lock xadded 2.5 times more than the slowest one, and out of 10 threads running on different two-way hyper-threads it did 3 times more
300 million times, on average increasingly tinier numbers of lock xadds take increasingly greater amounts of time, up to 1.1ms for 5 threads running on different cores and up to 193ms for 10 threads running on different two-way hyper-threads

and variance across runs of different processes is high.

来源：https://stackoverflow.com/questions/61744469/are-x86-atomic-rmw-instructions-wait-free

标签

concurrency

x86

atomic

lockless

wait-free