avoid cost of std::mutex when not multi-threading?

前端 未结 7 1404
星月不相逢
星月不相逢 2021-02-19 09:49

Suppose I have an application that may or may not have spawned multiple threads. Is it worth it to protect operations that need synchronization conditionally with a std::mutex a

7条回答
  •  终归单人心
    2021-02-19 10:26

    Uncontended locks are not too bad on modern systems, not needing to enter the kernel. But they still involve a full memory barrier and (or as part of) an atomic RMW operation. They're slower than a perfectly-predicted compare/branch.

    And being a function call, they defeat some optimizations, e.g. forcing the compiler to spill variables from registers back to memory, including the pointer members of a std::vector control block, introducing extra store/reload latency. (And actually the full memory barrier would defeat store-forwarding).

    (Being non-inlinable is how mutex functions actually prevent compile-time reordering on most implementations, as well as doing whatever in asm to atomically take the lock and prevent runtime reordering. This part involves draining the store buffer.)

    Depending on how much work you do and how fine-grained your locking is, the cost of an uncontended mutex can be pretty small. But if you're doing it around a every vector::push_back() in a loop, you might see a speedup factor on the order of about 20 for that loop.

    (Based on assumptions of one store per 2 or 3 clock cycles on average, which is reasonable assuming some memory-level parallelism and/or cache hits. A push_back loop could even be auto-vectorized and average better than 1 element per clock cycle, assuming small elements and cheap computation of values. lock cmpxchg on Skylake has 1 per 18 cycle throughput with no other memory operations in between; https://agner.org/optimize/. Other microarchitectures, including for non-x86 ISAs, will be different, but about an order of magnitude is probably a good ballpark estimate.)

    It might still be a negligible part of your total program run-time, though, and will slightly hurt the multi-thread case by doing extra loads, and another global var that has to stay hot in cache for good performance. And that global var might be in a different cache line from anything else.


    If you had a bad thread/mutex library where even the uncontended case entered the kernel, you could be looking at a factor of maybe 400 speedup, or tens of thousands on a modern x86 kernel that uses microcode-assisted Spectre mitigation by flushing the branch-predictors; that takes thousands of cycles every time you enter the kernel. I'd hope there aren't any systems with a kernel modern enough to do that but still using heavy-weight locks.

    I think the mainstream OSes (Linux / Mac / Windows) all have lightweight locking that only enters the kernel as a fallback on contention. See Jeff Preshing's Always Use a Lightweight Mutex article. Probably also Solaris and *BSD.

    (Cost to enter the kernel at all with syscall on Skylake x86: ~100 to 150 cycles or so, IIRC. With Spectre/Meltdown mitigations on x86, then you change page tables on entry and exit (expensive and potentially leading to TLB misses / page walks) and maybe use a special asm instruction to flush branch prediction.

    A system call is also essentially serializing; in a tight user-space loop, it doesn't leave much for out-of-order exec to look at. And there's at least some work within the kernel. (It also destroys any memory-level parallelism you could have had across loop iterations, but a full barrier from a mutex lock already does that.)

    So if for some reason you care about bad implementations with very expensive locks even in the uncontended case, you very likely want this. (And probably want the multi-threaded case to be less fine-grained). But such implementations are hopefully not widespread. GNU/Linux is definitely not like this, and AFAIK nothing important is either.


    gcc's libstdc++ already sort of does this optimization, checking __gthread_active_p () inside mutex lock/unlock (e.g. __gthread_mutex_lock in /usr/include/c++/9.1.0/x86_64-pc-linux-gnu/bits/gthr-default.h), doing nothing if false. And this is in a header so that wrapper around pthread_mutex_lock can inline into your code.

    On GNU/Linux (glibc) it works by checking if you built with g++ -pthread or not. (Checking if the (dynamic) linker gave us a non-zero address for a libpthread private function symbol name, using weak alias stuff. Since this condition is a link-time constant, it doesn't even need to be atomic<> so the compiler can keep the result in a register. It's basically just a load of a non-atomic void*.) libstdc++ on other OSes (not glibc) has other strategies for checking, see the other definitions.

    Mehrdad's test-case runs fast even for the Unconditional case, when built without -pthread. ~727ms for the 1000M iterations on Arch GNU/Linux, g++9.1 -O3, glibc 2.29-4, i7-6700k (Skylake) at ~4.2GHz (turbo) with echo performance > energy_performance_preference. That's almost exactly 3 clock cycles per iteration, bottlenecked on the 3 cycle loop-carried dependency chain through total1. (I bumped up the iteration count from Mehrdad's original instead of using higher-precision timing / printing, partly to hide startup overhead and max-turbo ramp up.)

    But with g++ -O3 -pthread so glibc's pthread_mutex_lock and unlock do get called, it's about 18 times slower on Skylake. About 13000ms on my machine, which is about 54 clock cycles / iteration.

    The test-case doesn't do any memory access inside the critical section, just
    total = ((total << 1) ^ i) + ((total >> 1) & i) on a local unsigned int total which the compiler can keep in a register across the mutex function calls. So the only stores that the lock cmpxchg (lock) and lock dec (unlock) have to drain from the store buffer are the plain stores to other mutex fields, and the return address pushed on the stack by x86's call instruction. This should be somewhat similar to a loop doing .push_back(i) on a std::vector. Per Agner Fog's testing, those locked instructions alone with no other memory access would account for 36 cycles of throughput cost. The actual 54 cycles/iter shows that other work in the lock/unlock functions, and waiting for other stores to flush, has a cost. (Out-of-order exec can overlap the actual total = ... calculation with all this; we know that locked instructions don't block out-of-order exec of independent ALU instructions on Skylake. Although mfence does because of a microcode update to fix an erratum, making gcc's mov+mfence strategy for seq-cst stores instead of xchg like other compilers even worse.)


    Footnote 1: At -O3, GCC hoists the if(__gthread_active_p ()) out of the loop, making two versions of the loop. (This is measurably faster than having 3 taken branches inside the loop, including the loop branch itself.)

    The "Conditional" version includes a useless load of single_threaded into a register that gets overwritten right away, because nothing happens based on the test. (Compilers don't optimize atomics at all, like volatile, so even an unused load stays. But fortunately x86-64 doesn't need any extra barrier instructions for seq_cst loads so it barely costs anything. Still, over 10 back-to-back runs: Conditional: 728ms pretty consistently. Unconditional: 727ms pretty consistently. vs. a calculated 716ms for 3 cycles/iter at a measured average of 4.19GHz user-space cycles/sec under perf stat -r10 ./a.out.

    But at -O2, the branches on __gthread_active_p stay inside the loop:

    • Conditional: 730 to 750 ms (less stable from run to run than before) with 2 branches per iteration.
    • Unconditional (no pthread): ~995 ms with 3 taken branches per iteration. Branch mis rate is still 0.00% but they do have a cost for the front-end.
    • Unconditional (with pthread): ~13100 ms (up from 13000 for -O3 unconditional)

    If you compile with gcc -O2, or even at -O3 if the compiler decides not to do loop-multiversioning or inversion or whatever it's called when an if is hoisted, you'll get asm like this:

    # g++ 9.1 -O2 for x86-64 on Arch GNU/Linux
    
        # early in the function, before any loops: load a symbol address into a 
        10de:       48 8b 2d f3 2e 00 00    mov    rbp,QWORD PTR [rip+0x2ef3]        # 3fd8 <__pthread_key_create@GLIBC_2.2.5>
         ...
    # "Unconditional" inner loop
        11b8:       48 85 ed                test   rbp,rbp           # do{
        11bb:       74 10                   je     11cd   # if( __gthread_active_p () )
          11bd:       4c 89 ef                mov    rdi,r13   # pass a pointer to the mutex in RDI
          11c0:       e8 bb fe ff ff          call   1080 
          11c5:       85 c0                   test   eax,eax
          11c7:       0f 85 f1 00 00 00       jne    12be   # if non-zero retval: jump to a call std::__throw_system_error( eax ) block
        11cd:       43 8d 04 24             lea    eax,[r12+r12*1]    # total<<1 = total+total
        11d1:       41 d1 ec                shr    r12d,1             # shifts in parallel
        11d4:       31 d8                   xor    eax,ebx
        11d6:       41 21 dc                and    r12d,ebx           # xor, and with i
        11d9:       41 01 c4                add    r12d,eax           # add the results: 3 cycle latency from r12 -> r12 assuming perfect scheduling
        11dc:       48 85 ed                test   rbp,rbp
        11df:       74 08                   je     11e9   # conditional skip mov/call
          11e1:       4c 89 ef                mov    rdi,r13
          11e4:       e8 77 fe ff ff          call   1060 
        11e9:       83 c3 01                add    ebx,0x1
        11ec:       81 fb 80 96 98 00       cmp    ebx,0x989680
        11f2:       75 c4                   jne    11b8   # }while(i<10000000)
    

    I can't repro this code-gen on Godbolt with g++, or clang with libc++. https://godbolt.org/z/kWQ9Rn Godbolt's install of libstdc++ maybe doesn't have the same macro defs as a proper install?

    call __gthrw_pthread_mutex_lock(pthread_mutex_t*) isn't inlining so we can't see the effect of the if (!__gthread_active_p ()) check.


    Make your check efficient if you do this

    If you're the only thread running, that won't change unless your loop starts threads.

    You can make the variable non-atomic. Set it right before you start any threads, then never write it again. All threads can then just read it into a register across loop iterations. And compilers can even hoist the check out of loops for you. (Like gcc -O3 does for the branch inside the GCC mutex implementation as described above, but not at -O2).

    You can manually hoist it out of a loop instead of letting compilers branch on a loop-invariant register value after hoisting the load of a non-atomic variable. If manually hoisting helps your compiler make a loop significantly faster, might as well go all-in on this optimization:

    // global scope
    bool multi_threaded = false;   // zero init lets this go in the BSS
    
    // in a function
    if (!multi_threaded) {
     // optionally take a lock here, outside an inner loop            std::lock_guard lock(mutex);
        for (int i = 0; i < n; ++i) {
           stuff;
        }
    } else {
        for (int i = 0; i < n; ++i) {
           std::lock_guard lock(mutex);
           stuff;
        }
    }
    

    Pull the loop body out into a function to avoid duplication if it's more than trivial.

    // starting threads
    multi_threaded = true;
    std::thread t(stuff);
    

    If you want to ever return to single-threaded mode, you can do that safely to at some point when you know you're the only thread:

    t.join();
    multi_threaded = false;    // all threads that could be reading this are now done
                               // so again it can be safely non-atomic
    

    You could even have multi_threaded variables for different data structures, to track whether there were multiple threads that might possibly look at a certain data structure. At that point you could think about making them atomic. Then you'd want bool nolocks = some_container.skip_locking.load(std::memory_order_relaxed); and use the same local for the whole loop.

    I haven't thought this through carefully, but I think that works as long as no other thread will set some_container.skip_locking and start another thread that accesses it; that wouldn't be safe anyway because this thread might be in the middle of modifying a data structure without holding a lock.

    You could even treat the flag like "coarse locking" instead of "no locking" so it still works if another thread wants to start using a data structure; the time from starting a new thread to when it can actually acquire a lock for this data structure might be significant if we hold the lock across a huge number of iterations.

     if (!some_container.fine_locking.load(std::memory_order_relaxed)) {
         // take a lock here, outside an inner loop
         std::lock_guard lock(mutex);
         for (int i = 0; i < n; ++i) {
             some_container.push_back(i);
         }
     } else {
         // lock *inside* the loop.
         for (int i = 0; i < n; ++i) {
             std::lock_guard lock(mutex);
             some_container.push_back(i);
         }
     }
    

    This could easily get pretty hairy, this is just brainstorming what's possible, not what's a good idea!

提交回复
热议问题