Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

问题

I assume simple spinlock that does not go to OS waiting for the purposes of this question.

I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg.

But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg?

Or does cmpxchg write data and invalidate cache line of other cores even on failure?

This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg, not in general.

回答1:

I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.

So far, no steady effect of difference between lock bts xchg or lock cmpxchg was observed.

Other stuff however had some effect:

Inner load loop is definitely helpful, both with and without pause
One pause in a loop is helpful, both with and without load loop
Load loop helps more than pause
The best results are achieved by applying "Improved version" from Intel® 64 and IA-32 Architectures Optimization Reference Manual (see below)
Starting with load instead of RMW/CAS has controversial effect: it is helpful for tests without pause, but degrades performance of tests with pause

Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause.

Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:

/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
 __asm__ ("pause");
 }
}

and improved version:

/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
   for (int i=mask; i; --i){
     __asm__ ("pause");
   }
   mask = mask < max ? mask<<1 : max;
 }
}

Windows SRWLOCK may also be a good example to follow. It uses load loop, and pause. it starts with interlocked operation lock bts for acquire exclusive, lock cmpxchg for acquire shared. Even TryAcquireSRWLockExclusive does only lock bts:

RtlTryAcquireSRWLockExclusive:
00007FFA86D71370  lock bts    qword ptr [rcx],0  
00007FFA86D71376  setae       al  
00007FFA86D71379  ret

It doesn't however implement exponentially growing pause in waiting versions. It does some small amount of loads with one pause, then goes to OS wait.

来源：https://stackoverflow.com/questions/63008857/does-cmpxchg-write-destination-cache-line-on-failure-if-not-is-it-better-than

标签

assembly

x86

cpu-cache

micro-optimization

compare-and-swap