false sharing in boost::detail::spinlock_pool?

后端 未结 1 1214
南旧
南旧 2021-02-01 22:22

I came across this SO question and reading it over eventually led me to look at boost::detail::spinlock_pool.

The purpose of boost::detail::spinlock_p

1条回答
  •  南笙
    南笙 (楼主)
    2021-02-01 22:56

    BRIEF

    You are correct, in that the spinlocks are sharing the same cache line, when they could possibly be in different cache lines. Aka false sharing. And there may well be some performance gain to be had by allocating the locks in different cache lines (but see below).

    However, this is NOT the same as lock contention. Lock contention arises when a lock is held by one guy, and one or more other guys want to access it.

    I.e. spinlock_pool has cache line contention caused by locks co-resident in the same cache line. But it has (reduced) software lock contention.

    The reduced software lock contention is almost undoubtedy good.

    The cache line contention probably hurts a bit, as your benchmarks show to some degree, but is a second order effect compared to the software lock contention.

    DETAIL

    Background

    First, some background:

    Classic spinloops are test-and-test-and-set:

        loop: 
             tmp := load( memory_location_of_sw_lock )
             if( is_locked(tmp) ) goto loop
             was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                                 value_that_will_get_sw_lock )
             if( was_locked(tmp) ) goto loop
         got_the_sw_lock:
              ...
              ... critical section
              ...
              // release the lock, e.g.
              memory_location_of_sw_lock := 0
    

    There are also test-and-set spinlocks, which look like

        loop: 
             was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                                 value_that_will_get_sw_lock )
             if( was_locked(tmp) ) goto loop
    

    These can have very bad performance on most modern memory systems with caches, write-through or write-back. (Although some of the hardware optimizations that I have proposed make test-and-set spinloops as fast as test-and-test-and-set spinloops - slightly faster because they are smaller.)

    Note that there are two different concepts of lock here: the "software" lock that the spinlock is acquiring, and the hardware lock that is used by the atomic_hw_locked_rmw instruction, such as Intel LOCK INC mem or CMPXCHG. We don;t care about the latter much, except to know that it usually unconditionally writes to the cache line holding the software lock, invalidated other copies of the cache line. (Making the write conditional is another possible hardware optimization.)

    O(N^2) burst of cache misses on (software) lock contention

    Lock contention with test-and-test-and-set spinloops is particularly bad. The waiters all spin on the lock, and when it is released there is a burst of bus accesses. One guy wins, the others realize that they have lost, and eventually they settle down to spin again. This burst of activity is particularly bad because for N waiting guys (threads/processes/processors) the burst of bus activity can be O(N^2) in size, because in the worst case everyone exits the test- part of a test-and-test-and-set spinloop, and everyone tries to do the atomic locked RMW (read-modify-write) instruction, like x86 LOCK INC mem or CMPXCHG, at the same time. This means that everyone will eventually write the line, even though all but the first don't need to write the lock because they won't get the lock.

    E.g.

    Lock is held by P0
    P1-PN are spinning in test-and-test-and-set spinloops waiting for it.
    
    P0 releases the lock, e.g. by writing it to 0
    
    P1's "test-" instruction reads the lock
    ...
    PN's "test-" instruction reads the lock
    
    All of P1-PN see the lock as released, 
       so they fall out of the "test-" part of the spinloop which uses ordinary instructions
       to the test-and-set part of the spinloop, which uses a hardware atomic RMW like LOCK INC mem
    
    P1's locked RMW happens, and acquires the spinlock for P1
       It invalidates all other cache lines
       P1 goes away and does something with the data protected by the lock
    
    P2's locked RMW fails to acquire the spinlock
       It invalidates all other caches because it has a write
       P1 falls back into its test- spinloop
    
    P3's locked RMW fails
       It invalidates all other caches because it has a write
       P1 falls back into its test- spinloop
    
    ...
    
    PN's locked RMW fails
    

    and now, at the very least, all of the remaining P2..PN processors must do ordinary unlocked cache misses for their test- spinloop. This implies at least N+(N-1) cache misses. It can be considerably worse, because it is possible for each write, by a waiter who will fail to acquire the lock, to trigger all of the other waiters to do an unlocked read. I.e., depending on timing, you may get

       1 release the lock
       N reads to test- the lock
       1 locked atomic RMW to acquire the lock
       N reads...
       for each of the remaining N-1 processors
          1 locked RMW that fails to acquire the lock
          N reads 
    

    which is O(N^2). Or possibly, for processor M, 1 locked RMW, followed by 2..M reads -- which is still O(N^2).

    What does this mean for this question?

    OK, if there was true lock contention you get this O(N^2) burst of cache misses when a contended lock is released.

    However, the spinlock_pool spreads the waiters out across multiple locks. If there are S locks in the spinlock pool, you get far fewer waiters: probably fewer than N/S (because contention tends to be superlinear in the number of people sharing the same lock).

    I.e. with Boost's spinlock_pool, you might naively expect to get 1/41-th the amount of contention --- and probably less than that.

    Remember, spinlock_pool is a compromise between having a lock per shared_pointer, increasing the size of the shared pointers, and having all of the shared_pointers share the same lock. So any contention on a shared_pointer's spinlock may be (a) true contention, or (b) false contention caused by shared_pointers that are independent hashing to the same entry in the spinlock_pool.

    Now, yes, if instead of having N waiters you have "only" N/41 waiters, then the burst is still O((N/41)^2) or O(N^2). But, if N typically less than 41... you get the idea.

    Basically, hashing to spread the shared_Pointers out over multiple spinlock_pool entries quickly reduces the amount of contention.

    But... the spinlocks live in the same cacheline? Right... but waiters on other cache lines will not advance to the test-and-set part of their spinloop.

    I.e. if a contended lock with M waiters shares a cache line with M other processes, you will get M*N traffic. But if the hashing has reduced M to O(1), you only get N traffic.

    And if most of the time there are no other waiters at all, then you get only O(1) raffic on a lock release.

    BOTTOM LINE:

    Reducing software lock contention is much higher benefit performance-wise than reducing cache false sharing that causes hardware cache line contention.

    Now, there still may be benefit to be hard in not packing so many spinlock_pool entries into the same cache line. But this is by no means obvious; it is an empirical question, by which I mean that you need to run an experiment, and it will vary with workload.

    Sometimes false sharing of these locks in the same cache line will be bad. A classic example is the locks that protect processor run queues.

    Sometimes false sharing of these locks in the same cache line is good - it can provide the same benefits that prefetch does in a cache. E.g. imagine that you have allocated an array of shared_pointers, and that people usually access aosptr[i] and then aosptr[i+1], etc.

    It all depends on your workload. I've seen it fall both ways. Usually in my experience one lock per cache line is better, although often not by much.

    MORE FUN

    In case you care, hardware optimization for test-and-test-and-set spinloops was the subject of my MS thesis, "A Feature Taxonomy of Synchronization Primitives, including the Bus Abandonment Lock" - a hardware update cache protocol that eliminated the burst of bus accesses. Not published in any formal publication, except for University Microfiche.

    My work was inspired by the paper "The performance of spin lock alternatives for shared-memory multiprocessors" by T Anderson - IEEE Transactions on Parallel and Distributed Systems , 1990.

    I think it's fair to say that most publications have been on software, algorithms, including the famous MCS lock. I think it is also fair to say that such techniques, while popular in theory, have been little used by journeymen programmers.

    By the way, there are some more hardware optimizations possible here. E.g. CMPXCHG need not write the lock at all. Unfortunately, on current Intel hardware, or at least on the Intel hardware I designed in 1991, which I still think is used, the only way to release the hardware lock (that is used to implement the atomic locked RMW) is to use a special microcode write "store-unlock". Heck, the microcode could even use LoadLinked/StoreConditional (LL/SC), falling back to locking in those situations where a naive LL/SC makes no forward progress on some threads.

    It is possible that Intel has fixed this recently. I left Intel in 2009. I was trying to get it fixed, improved, optimized, oh, since 1991. And Intel has been improving the performance of locks a lot recently. But I think they have mainly been worked on uncontended lock performance, and have not been optimizing contended lock performance.

    Similarly, Alain Kagi, in his dissertation and in some publications, as well as patent http://www.google.com/patents/US6460124, shows that by adding judicious delays the cache hardware can make spinlocks as efficient as queued locks.

    Some of these hardware optimizations make test-and-set spinloops better than test-and-test-and-set spinloops.

    The most recent development has been Intel TSX (Transactional Synchronization Extensions), consisting of HLE (Hardware Lock Elision (Ravi Rajwar worked on this at UWisc while I and Alain Kagi were there, although my work in synchronization was earlier at UIUC)) and RTM (Restricted Transactional Memory). Both of these help uncontended .. hmm, they help with contended locks that are falsely contending, e.g. a coarse grain lock protecting what could be independent stuff. I.e. false lock contention. In some ways, HLE may make spinlock_pool unnecessary.

    APOLOGY

    I am sorry: I have provided a long winded answer that boils down to "reducing software lock contention can be much more important than hardware cache line contention for spinloops".

    While I would expect some performance to be gained from allocating 1 spinloop per cache line, it may be small, and on some not-exactly-uncommon workloads there may even be a loss.

    And to be sure, you would have to measure it.

    But, in any case, the big gain comes from reducing software lock contention.

    0 讨论(0)
提交回复
热议问题