Why slim reader/writer exclusive lock outperformance the shared one?

前端 未结 3 762
遥遥无期
遥遥无期 2021-02-04 15:49

I have tested the performance of slim reader/writer lock under windows 7 using the codefrom Windows Via C/C++.

The result surprised me that the exclusive lock o

3条回答
  •  说谎
    说谎 (楼主)
    2021-02-04 15:57

    This is a pretty common result for small general-purpose locks (like SRWLocks, which are only one pointer in size).

    Key Takeaway: If you have an extremely small guarded section of code, such that the overhead of the lock itself might be dominant, an exclusive lock is better to use than a shared lock.

    Also, Raymond Chen's argument about the contention on g_Value is true as well. If g_Value were read instead of written in both cases, you might notice a benefit for the shared lock.

    Details:

    The SRW lock is implemented using a single pointer-sized atomic variable which can take on a number of different states, depending on the values of the low bits. The description of the way these bits are used is out of scope for this comment--the number of state transitions is pretty high--so, I'll mention only a few states that you may be encountering in your test.

    Initial lock state: (0, ControlBits:0) -- An SRW lock starts with all bits set to 0.

    Shared state: (ShareCount: n, ControlBits: 1) -- When there is no conflicting exclusive acquire and the lock is held shared, the share count is stored directly in the lock variable.

    Exclusive state: (ShareCount: 0, ControlBits: 1) -- When there is no conflicting shared acquire or exclusive acquire, the lock has a low bit set and nothing else.

    Example contended state: (WaitPtr:ptr, ControlBits: 3) -- When there is a conflict, the threads that are waiting for the lock form a queue using data allocated on the waiting threads' stacks. The lock variable stores a pointer to the tail of the queue instead of a share count.

    In this scheme, trying to acquire an exclusive lock when you don't know the initial state is a single write to the lock word, to set the low bit and retrieve the old value (this can be done on x86 with a LOCK BTS instruction). If you succeeded (as you always will do in the 1 thread case), you can proceed into the locked region with no further operations.

    Trying to acquire a shared lock is a more involved operation: You need to first read the initial value of the lock variable to determine the old share count, increment the share count you read, and then write the updated value back conditionally with the LOCK CMPXCHG instruction. This is a noticeably longer chain of serially-dependent instructions, so it is slower. Also CMPXCGH is a bit slower on many processors than the unconditional atomic instructions like LOCK BTS.

    It would be possible in theory to speed up the first shared acquire of a the lock by assuming that the lock was in its initial state at the beginning and performing the LOCK CMPXCHG first. This would speed up the initial shared acquire of the lock (all of them in your single-threaded case), but it would pretty significantly slow down the cases where the lock is already held shared and a second shared acquire occurs.

    A similar set of divergent operations occurs when the lock is being released, so the extra cost of managing the shared state is also paid on the ReleaseSRWLockShared side.

提交回复
热议问题