问题
I'm reading about [lock cmpxchg
description]) https://www.felixcloutier.com/x86/CMPXCHG.html):
This instruction can be used with a
LOCK
prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)
Now consider two threads executing lock cmpxchg
:
Thread 1 Thread 2
mov ebx, 0x4000 mov ebx, 0x4000 ; address
mov edx, 0x62ab6 mov edx, 0x62ab8 ; new val
mov eax, 0x62ab1 mov eax, 0x62ab1 ; old
lock cmpxchg [ebx], eax lock cmpxchg [ebx], eax ; <----- here
The question is can both lock'ed cmpxchg
in Thread 1 and Thread 2 fail?
Since
the destination operand receives a write cycle without regard to the result of the comparison
I could guess that both of the threads can have the write cycle and than both of them can be reverted because of comparing to a stale value... But I'm not sure if this is correct.
Maybe I need to look at the cas implementation details, but it is not specified in the intel instruction reference (At least I could not find)
回答1:
My understanding is that lock cmpxchg
cannot fail spuriously - unlike LL/SC - assuming the value at the memory address indeed matches. It builds those guarantees from the cache coherency protocol by taking exclusive ownership of the cache line and not yielding it to other cores until the operation is done.
So CAS can only fail for all threads if some other thread wrote to the memory location.
回答2:
@the8472's answer is correct, but I wanted to add an alternate answer.
https://www.felixcloutier.com/x86/CMPXCHG.html already specifies the behaviour in enough detail to rule out the possibility of spurious failure. If it could fail for some reason other than the value in memory not matching eax
, the docs would have to say so.
You can also note the fact that compilers use a single lock cmpxchg
for C++11 std::atomic::compare_exchange_strong, from which you can conclude that compiler writers think lock cmpxchg
can't spuriously fail.
#include <atomic>
bool cas_bool(std::atomic_int *a, int expect, int want) {
return a->compare_exchange_strong(expect, want);
}
compiles to (gcc7.3 -O3):
cas_bool(std::atomic<int>*, int, int):
mov eax, esi
lock cmpxchg DWORD PTR [rdi], edx
sete al
ret
See also Can num++ be atomic for 'int num'? for more details of how lock
ed instructions are implemented internally, and how they interact with MESI. (i.e. @the8472's answer is the short version: for an operand that doesn't cross a cache line, a core just hangs onto that cache line so nothing else in the system can read or write it for the duration of the lock cmpxchg
).
the destination operand receives a write cycle without regard to the result of the comparison
The read + write pair are atomic with respect to all other observers in the system. The ordering you propose, of read1 / read2 / write1 / abort write2 is impossible because lock cmpxchg
is atomic, so read2
can't appear between read1 and write1 in the global order.
Also, that language only applies to the external memory bus. Modern CPUs with integrated memory controllers can do whatever they want (for lock cmpxchg
on an address that's split across two cache lines). Intel may publish documentation for motherboard vendors to use in their internal testing of signals on the memory bus.
That documentation might still be relevant for lock cmpxchg
on an MMIO address, but definitely not for an aligned operand in write-back memory. In that case, it's just a cache lock. (And it's a hidden implementation detail whether the L1d cache is written or not when the compare fails). I guess you could test this by seeing if it dirties the cache line (i.e. puts it in Modified state instead of Exclusive).
For more discussion about how lock cmpxchg
might work internally vs. xchg
, see the chat thread between me and @BeeOnRope following my answer on Exit critical region. (Mostly me having ideas that could work in theory, but are incompatible with what we know about Intel x86 CPUs, and @BeeOnRope pointing out my mistakes. https://chat.stackoverflow.com/transcript/message/42472667#42472667. There's very little we can conclude for sure about the fine details of efficiency of xchg
vs. lock cmpxchg
. It's certainly possible that xchg
keeps the cache line locked for fewer cycles than lock cmpxchg
, but that needs to be tested. I think xchg
has better latency if used back-to-back on the same location from a single thread, though.)
来源:https://stackoverflow.com/questions/50268929/can-cas-fail-for-all-threads