问题
I assume simple spinlock that does not go to OS waiting for the purposes of this question.
I see that simple spinlock is often implemented using lock xchg
or lock bts
instead of lock cmpxchg
.
But doesn't cmpxchg
avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg
?
Or does cmpxchg
write data and invalidate cache line of other cores even on failure?
This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg
, not in general.
回答1:
I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.
So far, no steady effect of difference between lock bts
xchg
or lock cmpxchg
was observed.
Other stuff however had some effect:
- Inner
load
loop is definitely helpful, both with and withoutpause
- One
pause
in a loop is helpful, both with and without load loop - Load loop helps more than pause
- The best results are achieved by applying "Improved version" from Intel® 64 and IA-32 Architectures Optimization Reference Manual (see below)
- Starting with load instead of RMW/CAS has controversial effect: it is helpful for tests without
pause
, but degrades performance of tests withpause
Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause
.
Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:
/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
__asm__ ("pause");
}
}
and improved version:
/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
for (int i=mask; i; --i){
__asm__ ("pause");
}
mask = mask < max ? mask<<1 : max;
}
}
Windows SRWLOCK
may also be a good example to follow. It uses load loop, and pause
. it starts with interlocked operation lock bts
for acquire exclusive, lock cmpxchg
for acquire shared. Even TryAcquireSRWLockExclusive
does only lock bts
:
RtlTryAcquireSRWLockExclusive:
00007FFA86D71370 lock bts qword ptr [rcx],0
00007FFA86D71376 setae al
00007FFA86D71379 ret
It doesn't however implement exponentially growing pause
in waiting versions. It does some small amount of loads with one pause
, then goes to OS wait.
来源:https://stackoverflow.com/questions/63008857/does-cmpxchg-write-destination-cache-line-on-failure-if-not-is-it-better-than