Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

二次信任 提交于 2020-08-08 06:19:28

问题


I assume simple spinlock that does not go to OS waiting for the purposes of this question.

I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg.

But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg?

Or does cmpxchg write data and invalidate cache line of other cores even on failure?

This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg, not in general.


回答1:


I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.

So far, no steady effect of difference between lock bts xchg or lock cmpxchg was observed.

Other stuff however had some effect:

  • Inner load loop is definitely helpful, both with and without pause
  • One pause in a loop is helpful, both with and without load loop
  • Load loop helps more than pause
  • The best results are achieved by applying "Improved version" from Intel® 64 and IA-32 Architectures Optimization Reference Manual (see below)
  • Starting with load instead of RMW/CAS has controversial effect: it is helpful for tests without pause, but degrades performance of tests with pause

Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause.

Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:

/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
 __asm__ ("pause");
 }
}

and improved version:

/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
   for (int i=mask; i; --i){
     __asm__ ("pause");
   }
   mask = mask < max ? mask<<1 : max;
 }
}

Windows SRWLOCK may also be a good example to follow. It uses load loop, and pause. it starts with interlocked operation lock bts for acquire exclusive, lock cmpxchg for acquire shared. Even TryAcquireSRWLockExclusive does only lock bts:

RtlTryAcquireSRWLockExclusive:
00007FFA86D71370  lock bts    qword ptr [rcx],0  
00007FFA86D71376  setae       al  
00007FFA86D71379  ret  

It doesn't however implement exponentially growing pause in waiting versions. It does some small amount of loads with one pause, then goes to OS wait.



来源:https://stackoverflow.com/questions/63008857/does-cmpxchg-write-destination-cache-line-on-failure-if-not-is-it-better-than

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!