I\'m new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T synta
The syntax is wrong. It works after a little modification.
spin_lock:
movl $0, %eax
movl $1, %ecx
lock cmpxchg %ecx, (lock_addr)
jnz spin_lock
ret
spin_unlock:
movl $0, (lock_addr)
ret
To provide a code running faster. Assume lock_addr
is store in %rdi
redister.
Use movl
and test
instead of lock cmpxchgl %ecx, (%rdi)
to spin.
Use lock cmpxchgl %ecx, (%rdi)
for trying to enter critical section only if there's a chance.
Then could avoid unneeded bus locking.
spin_lock:
movl $1, %ecx
loop:
movl (%rdi), %eax
test %eax, %eax
jnz loop
lock cmpxchgl %ecx, (%rdi)
jnz loop
ret
spin_unlock:
movl $0, (%rdi)
ret
I have tested it using pthread and an easy loop like this.
for(i = 0; i < 10000000; ++i){
spin_lock(&mutex);
++count;
spin_unlock(&mutex);
}
In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.
You have the right idea, but your asm is broken:
cmpxchg
can't work with an immediate operand, only registers.
lock
is not a valid prefix for mov
. mov
to an aligned address is atomic on x86, so you don't need lock
anyway.
It has been some time since I've used AT&T syntax, hope I remembered everything:
spin_lock:
xorl %ecx, %ecx
incl %ecx # newVal = 1
spin_lock_retry:
xorl %eax, %eax # expected = 0
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret
spin_unlock:
movl $0, (lock_addr) # atomic release-store
ret
Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
void spin_unlock(int volatile *p)
{
asm volatile ("":::"memory"); // acts as a memory barrier.
*p = 0;
}
As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg
execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:
void spin_lock(int volatile *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1))
{
// spin read-only until a cmpxchg might succeed
while(*p) _mm_pause(); // or maybe do{}while(*p) to pause first
}
}
The pause
instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause
, it is treated as a nop
.
pause
also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?
Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg
/ pause
loop doesn't do.
Also note that cmpxchg
is unnecessary for a simple spinlock: you can use xchg
and then check whether the old value was 0 or not. Doing less work inside the lock
ed instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg
and pause
(but still with no fallback to OS-assisted sleep, just spinning indefinitely.)
This will put less contention on the memory bus:
void spin_lock(int *p)
{
while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}