Suppose I have an application that may or may not have spawned multiple threads. Is it worth it to protect operations that need synchronization conditionally with a std::mutex a
Uncontended locks are not too bad on modern systems, not needing to enter the kernel. But they still involve a full memory barrier and (or as part of) an atomic RMW operation. They're slower than a perfectly-predicted compare/branch.
And being a function call, they defeat some optimizations, e.g. forcing the compiler to spill variables from registers back to memory, including the pointer members of a std::vector
control block, introducing extra store/reload latency. (And actually the full memory barrier would defeat store-forwarding).
(Being non-inlinable is how mutex functions actually prevent compile-time reordering on most implementations, as well as doing whatever in asm to atomically take the lock and prevent runtime reordering. This part involves draining the store buffer.)
Depending on how much work you do and how fine-grained your locking is, the cost of an uncontended mutex can be pretty small. But if you're doing it around a every vector::push_back()
in a loop, you might see a speedup factor on the order of about 20 for that loop.
(Based on assumptions of one store per 2 or 3 clock cycles on average, which is reasonable assuming some memory-level parallelism and/or cache hits. A push_back
loop could even be auto-vectorized and average better than 1 element per clock cycle, assuming small elements and cheap computation of values. lock cmpxchg
on Skylake has 1 per 18 cycle throughput with no other memory operations in between; https://agner.org/optimize/. Other microarchitectures, including for non-x86 ISAs, will be different, but about an order of magnitude is probably a good ballpark estimate.)
It might still be a negligible part of your total program run-time, though, and will slightly hurt the multi-thread case by doing extra loads, and another global var that has to stay hot in cache for good performance. And that global var might be in a different cache line from anything else.
If you had a bad thread/mutex library where even the uncontended case entered the kernel, you could be looking at a factor of maybe 400 speedup, or tens of thousands on a modern x86 kernel that uses microcode-assisted Spectre mitigation by flushing the branch-predictors; that takes thousands of cycles every time you enter the kernel. I'd hope there aren't any systems with a kernel modern enough to do that but still using heavy-weight locks.
I think the mainstream OSes (Linux / Mac / Windows) all have lightweight locking that only enters the kernel as a fallback on contention. See Jeff Preshing's Always Use a Lightweight Mutex article. Probably also Solaris and *BSD.
(Cost to enter the kernel at all with syscall
on Skylake x86: ~100 to 150 cycles or so, IIRC. With Spectre/Meltdown mitigations on x86, then you change page tables on entry and exit (expensive and potentially leading to TLB misses / page walks) and maybe use a special asm instruction to flush branch prediction.
A system call is also essentially serializing; in a tight user-space loop, it doesn't leave much for out-of-order exec to look at. And there's at least some work within the kernel. (It also destroys any memory-level parallelism you could have had across loop iterations, but a full barrier from a mutex lock already does that.)
So if for some reason you care about bad implementations with very expensive locks even in the uncontended case, you very likely want this. (And probably want the multi-threaded case to be less fine-grained). But such implementations are hopefully not widespread. GNU/Linux is definitely not like this, and AFAIK nothing important is either.
gcc's libstdc++ already sort of does this optimization, checking __gthread_active_p ()
inside mutex lock/unlock (e.g. __gthread_mutex_lock in /usr/include/c++/9.1.0/x86_64-pc-linux-gnu/bits/gthr-default.h), doing nothing if false. And this is in a header so that wrapper around pthread_mutex_lock
can inline into your code.
On GNU/Linux (glibc) it works by checking if you built with g++ -pthread or not. (Checking if the (dynamic) linker gave us a non-zero address for a libpthread private function symbol name, using weak alias stuff. Since this condition is a link-time constant, it doesn't even need to be atomic<>
so the compiler can keep the result in a register. It's basically just a load of a non-atomic void*
.) libstdc++ on other OSes (not glibc) has other strategies for checking, see the other definitions.
Mehrdad's test-case runs fast even for the Unconditional case, when built without -pthread
. ~727ms for the 1000M iterations on Arch GNU/Linux, g++9.1 -O3
, glibc 2.29-4, i7-6700k (Skylake) at ~4.2GHz (turbo) with echo performance > energy_performance_preference
. That's almost exactly 3 clock cycles per iteration, bottlenecked on the 3 cycle loop-carried dependency chain through total
1. (I bumped up the iteration count from Mehrdad's original instead of using higher-precision timing / printing, partly to hide startup overhead and max-turbo ramp up.)
But with g++ -O3 -pthread
so glibc's pthread_mutex_lock
and unlock
do get called, it's about 18 times slower on Skylake. About 13000ms on my machine, which is about 54 clock cycles / iteration.
The test-case doesn't do any memory access inside the critical section, just
total = ((total << 1) ^ i) + ((total >> 1) & i)
on a local unsigned int total
which the compiler can keep in a register across the mutex function calls. So the only stores that the lock cmpxchg
(lock) and lock dec
(unlock) have to drain from the store buffer are the plain stores to other mutex fields, and the return address pushed on the stack by x86's call
instruction. This should be somewhat similar to a loop doing .push_back(i)
on a std::vector. Per Agner Fog's testing, those lock
ed instructions alone with no other memory access would account for 36 cycles of throughput cost. The actual 54 cycles/iter shows that other work in the lock/unlock functions, and waiting for other stores to flush, has a cost. (Out-of-order exec can overlap the actual total = ...
calculation with all this; we know that locked instructions don't block out-of-order exec of independent ALU instructions on Skylake. Although mfence does because of a microcode update to fix an erratum, making gcc's mov+mfence strategy for seq-cst stores instead of xchg
like other compilers even worse.)
Footnote 1: At -O3
, GCC hoists the if(__gthread_active_p ())
out of the loop, making two versions of the loop. (This is measurably faster than having 3 taken branches inside the loop, including the loop branch itself.)
The "Conditional" version includes a useless load of single_threaded
into a register that gets overwritten right away, because nothing happens based on the test. (Compilers don't optimize atomics at all, like volatile
, so even an unused load stays. But fortunately x86-64 doesn't need any extra barrier instructions for seq_cst loads so it barely costs anything. Still, over 10 back-to-back runs: Conditional: 728ms pretty consistently. Unconditional: 727ms pretty consistently. vs. a calculated 716ms for 3 cycles/iter at a measured average of 4.19GHz user-space cycles/sec under perf stat -r10 ./a.out
.
But at -O2
, the branches on __gthread_active_p
stay inside the loop:
-O3
unconditional)If you compile with gcc -O2, or even at -O3 if the compiler decides not to do loop-multiversioning or inversion or whatever it's called when an if is hoisted, you'll get asm like this:
# g++ 9.1 -O2 for x86-64 on Arch GNU/Linux
# early in the function, before any loops: load a symbol address into a
10de: 48 8b 2d f3 2e 00 00 mov rbp,QWORD PTR [rip+0x2ef3] # 3fd8 <__pthread_key_create@GLIBC_2.2.5>
...
# "Unconditional" inner loop
11b8: 48 85 ed test rbp,rbp # do{
11bb: 74 10 je 11cd # if( __gthread_active_p () )
11bd: 4c 89 ef mov rdi,r13 # pass a pointer to the mutex in RDI
11c0: e8 bb fe ff ff call 1080
11c5: 85 c0 test eax,eax
11c7: 0f 85 f1 00 00 00 jne 12be # if non-zero retval: jump to a call std::__throw_system_error( eax ) block
11cd: 43 8d 04 24 lea eax,[r12+r12*1] # total<<1 = total+total
11d1: 41 d1 ec shr r12d,1 # shifts in parallel
11d4: 31 d8 xor eax,ebx
11d6: 41 21 dc and r12d,ebx # xor, and with i
11d9: 41 01 c4 add r12d,eax # add the results: 3 cycle latency from r12 -> r12 assuming perfect scheduling
11dc: 48 85 ed test rbp,rbp
11df: 74 08 je 11e9 # conditional skip mov/call
11e1: 4c 89 ef mov rdi,r13
11e4: e8 77 fe ff ff call 1060
11e9: 83 c3 01 add ebx,0x1
11ec: 81 fb 80 96 98 00 cmp ebx,0x989680
11f2: 75 c4 jne 11b8 # }while(i<10000000)
I can't repro this code-gen on Godbolt with g++, or clang with libc++. https://godbolt.org/z/kWQ9Rn Godbolt's install of libstdc++ maybe doesn't have the same macro defs as a proper install?
call __gthrw_pthread_mutex_lock(pthread_mutex_t*)
isn't inlining so we can't see the effect of the if (!__gthread_active_p ())
check.
If you're the only thread running, that won't change unless your loop starts threads.
You can make the variable non-atomic. Set it right before you start any threads, then never write it again. All threads can then just read it into a register across loop iterations. And compilers can even hoist the check out of loops for you. (Like gcc -O3
does for the branch inside the GCC mutex implementation as described above, but not at -O2
).
You can manually hoist it out of a loop instead of letting compilers branch on a loop-invariant register value after hoisting the load of a non-atomic variable. If manually hoisting helps your compiler make a loop significantly faster, might as well go all-in on this optimization:
// global scope
bool multi_threaded = false; // zero init lets this go in the BSS
// in a function
if (!multi_threaded) {
// optionally take a lock here, outside an inner loop std::lock_guard lock(mutex);
for (int i = 0; i < n; ++i) {
stuff;
}
} else {
for (int i = 0; i < n; ++i) {
std::lock_guard lock(mutex);
stuff;
}
}
Pull the loop body out into a function to avoid duplication if it's more than trivial.
// starting threads
multi_threaded = true;
std::thread t(stuff);
If you want to ever return to single-threaded mode, you can do that safely to at some point when you know you're the only thread:
t.join();
multi_threaded = false; // all threads that could be reading this are now done
// so again it can be safely non-atomic
You could even have multi_threaded variables for different data structures, to track whether there were multiple threads that might possibly look at a certain data structure. At that point you could think about making them atomic
. Then you'd want bool nolocks = some_container.skip_locking.load(std::memory_order_relaxed);
and use the same local for the whole loop.
I haven't thought this through carefully, but I think that works as long as no other thread will set some_container.skip_locking
and start another thread that accesses it; that wouldn't be safe anyway because this thread might be in the middle of modifying a data structure without holding a lock.
You could even treat the flag like "coarse locking" instead of "no locking" so it still works if another thread wants to start using a data structure; the time from starting a new thread to when it can actually acquire a lock for this data structure might be significant if we hold the lock across a huge number of iterations.
if (!some_container.fine_locking.load(std::memory_order_relaxed)) {
// take a lock here, outside an inner loop
std::lock_guard lock(mutex);
for (int i = 0; i < n; ++i) {
some_container.push_back(i);
}
} else {
// lock *inside* the loop.
for (int i = 0; i < n; ++i) {
std::lock_guard lock(mutex);
some_container.push_back(i);
}
}
This could easily get pretty hairy, this is just brainstorming what's possible, not what's a good idea!