Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

帅比萌擦擦* 提交于 2019-11-28 23:46:31

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.

The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.

Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.

"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

I may well not properly have understood the question, but...

If you're spinning, one problem is the compiler optimizing your spin away. Volatile solves this.

The memory barrier, if you have one, will be issued by the writer to the spin lock, not the reader. The writer doesn't actually have to use one - doing so ensures the write is pushed out immediately, but it'll go out pretty soon anyway.

The barrier prevents for a thread executing that code re-ordering across it's location, which is its other cost.

Keep in mind that barriers typically are used to order sets of memory accesses, so your code could very likely also need barriers in other places. For example, it wouldn't be uncommon for the barrier requirement to look like this instead:

while ( 1 ) {

    v = pShared->value;
    __acquire_barrier() ;

    if ( v != 0 ) {
        foo( pShared->something ) ;
    }
}

This barrier would prevent loads and stores in the if block (ie: pShared->something) from executing before the value load is complete. A typical example is that you have some "producer" that used a store of v != 0 to flag that some other memory (pShared->something) is in some other expected state, as in:

pShared->something = 1 ;  // was 0
__release_barrier() ;
pShared->value = 1 ;  // was 0

In this typical producer consumer scenario, you'll almost always need paired barriers, one for the store that flags that the auxiliary memory is visible (so that the effects of the value store aren't seen before the something store), and one barrier for the consumer (so that the something load isn't started before the value load is complete).

Those barriers are also platform specific. For example, on powerpc (using the xlC compiler), you'd use __isync() and __lwsync() for the consumer and producer respectively. What barriers are required may also depend on the mechanism that you use for the store and load of value. If you've used an atomic intrinsic that results in an intel LOCK (perhaps implicit), then this will introduce an implicit barrier, so you may not need anything. Additionally, you'll likely also need to judicious use of volatile (or preferably use an atomic implementation that does so under the covers) in order to get the compiler to do what you want.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!