I read somewhere (can\'t find the page anymore) that lock free data structures are more efficient \"for certain workloads\" which seems to imply that sometimes they\'re actually
I would like to add one point to this part of the answer: "Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released."
Seems like different operating systems can have different approaches as to what to do when lock acquisition failed. I use HP-UX and it for example has a more sophisticated approach to locking mutexes. Here is its description:
... On the other hand, changing context is an expensive process. If the wait is going to be a short one, we'd rather not do the context switch. To balance out these requirements, when we try to get a semaphore and find it locked, the first thing we do is a short spin wait. The routine psema_spin_1() is called to spin for up to 50,000 clock cycles trying to get the lock. If we fail to get the lock after 50,000 cycles, we then call psema_switch_1() to give up the processor and let another process take over.