Where can I find documentation for \"adaptive\" pthread mutexes? The symbol PTHREAD_MUTEX_ADAPTIVE_NP is defined on my system, but the only documentation I can find online say
PTHREAD_MUTEX_ADAPTIVE_NP
is something that I invented while working in the role of a glibc contributor on making LinuxThreads more reliable and perform better. LinuxThreads was the predecessor to glibc's NPTL library, originally developed as a stand-alone library by Xavier Leroy, who is also well-known as one of the creators of OCaml.
The adaptive mutex survived into NTPL in essentially unmodified form: the code is nearly identical, including the magic constants for the estimator smoothing and the maximum spin relative to the estimator.
Under SMP, when you go to acquire a mutex and see that it is locked, it can be sub-optimal to simply give up and call into the kernel to block. If the owner of the lock only holds the lock for a few instructions, it is cheaper to just wait for the execution of those instructions, and then acquire the lock with an atomic operation, instead of spending hundreds of extra cycles by making a system call.
The kernel developers know this very well, which is one reason why we have spinlocks in the Linux kernel for fast critical sections. (Among the other reasons is, of course, that code which cannot sleep, because it is in an interrupt context, can acquire spinlocks.)
The question is, how long should you wait? If you spin forever until the lock is acquired, that can be sub-optimal. User space programs are not well-written like kernel code (cough). They could have long critical sections. They also cannot disable pre-emption; sometimes critical sections blow up due to a context switch. (POSIX threads now provide real time tools to deal with this: you can put threads into a real-time priority and FIFO scheduling and such, plus configure processor affinity.)
I think we experimented with fixed iteration counts, but then I had this idea: why should we guess, when we can measure. Why don't we implement a smoothed estimator of the lock duration, similarly to what we do for the TCP retransmission time-out (RTO) estimator. Each time we spin on a lock, we should measure how many spins it actually took to acquire it. Moreover, we should not spin forever: we should perhaps spin only at most twice the current estimator value. When we take a measurement, we can smooth it exponentially, in just a few instructions: take a fraction of the previous value, and of the new value, and add them together, which is the same as adding a fraction of their difference to back to the estimator: say, estimator += (new_val - estimator)/8
for a 1/8 to 7/8 blend between the old and new value.
You can think of this as a watchdog. Suppose that the estimator tells you that the lock, on average, takes 80 spins to acquire. You can be quite confident, then, that if you have executed 160 spins, then something is wrong: the owner of the lock is executing some exceptionally long case, or maybe has hit a page fault or was otherwise preempted. At this point the waiting thread cuts its losses and calls into the kernel to block.
Without measurement, you cannot do this accurately: there is no "one size fits all" value. Say, a fixed limit of 200 spins would be sub-optimal in a program whose critical sections are so short that a lock can almost always be fetched after waiting only 10 spins. The mutex locking function would burn through 200 iterations every time there is an anomalous wait time, instead of nicely giving up at, say, 20 and saving cycles.
This adaptive approach is specialized, in the sense that it will not work for all locks in all programs, so it is packaged as a special mutex type. For instance, it will not work very well for programs that lock mutexes for long periods: periods so long that more CPU time is wasted spinning on the large estimator values than would have been by going into the kernel. The approach is also not suitable for uniprocessors: all threads besides the one which is trying to get the lock are suspended in the kernel. The approach is also not suitable in situations in which fairness is important: it is an opportunistic lock. No matter how many other threads have been waiting, for no matter how long, or what their priority is, a new thread can come along and snatch the lock.
If you have very well-behaved code with short critical sections that are highly contended, and you're looking for better performance on SMP, the adaptive mutex may be worth a try.
The symbol is mentionned there:
http://elias.rhi.hi.is/libc/Mutexes.html
"LinuxThreads supports only one mutex attribute: the mutex type, which is either PTHREAD_MUTEX_ADAPTIVE_NP for "fast" mutexes, PTHREAD_MUTEX_RECURSIVE_NP for "recursive" mutexes, PTHREAD_MUTEX_TIMED_NP for "timed" mutexes, or PTHREAD_MUTEX_ERRORCHECK_NP for "error checking" mutexes. As the NP suffix indicates, this is a non-portable extension to the POSIX standard and should not be employed in portable programs.
The mutex type determines what happens if a thread attempts to lock a mutex it already owns with pthread_mutex_lock. If the mutex is of the "fast" type, pthread_mutex_lock simply suspends the calling thread forever. If the mutex is of the "error checking" type, pthread_mutex_lock returns immediately with the error code EDEADLK. If the mutex is of the "recursive" type, the call to pthread_mutex_lock returns immediately with a success return code. The number of times the thread owning the mutex has locked it is recorded in the mutex. The owning thread must call pthread_mutex_unlock the same number of times before the mutex returns to the unlocked state.
The default mutex type is "timed", that is, PTHREAD_MUTEX_TIMED_NP."
EDIT: updated with info found by jthill (thanks!)
A little more info on the mutex flags and the PTHREAD_MUTEX_ADAPTIVE_NP can be found here:
"The PTHRED_MUTEX_ADAPTIVE_NP is a new mutex that is intended for high throughput at the sacrifice of fairness and even CPU cycles. This mutex does not transfer ownership to a waiting thread, but rather allows for competition. Also, over an SMP kernel, the lock operation uses spinning to retry the lock to avoid the cost of immediate descheduling."
Which basically suggest the following: in case where high thoughput is desirable, such mutex can be implemented requiring extra considerations from the thread logic due to it's very nature. You will have to design an algorithm that can use these properties resulting in high throughput. Something that load balances itself from within (as opposed to "from the kernel") where order of execution is unimportant.
There was a very good book for linux/unix multithreading programming which name escapes me. If I find it I'll update.
Here you go. As I read it, it's a brutally simple mutex that doesn't care about anything except making the no-contention case run fast.