Average latency of atomics cmpxchg instructions on Intel Cpus

前端 未结 5 466
独厮守ぢ
独厮守ぢ 2020-12-31 16:09


I am looking for some reference on average latencies for lock cmpxchg instruction for various intel processors. I am not able to locate any good reference on the topic

相关标签:
5条回答
  • 2020-12-31 16:15

    The best x86 instruction latency reference is probably that contained in Agner's optimization manuals, based on actual empirical measurements on various Intel/AMD/VIA chips and frequently updated for the latest CPUs on the market.

    Unfortunately, I don't see the CMPXCHG instruction listed in the instruction latency tables, but page 4 does state:

    Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

    0 讨论(0)
  • 2020-12-31 16:17

    You can use AIDA64 software to check instruction latencies (but you cannot check which of the instructions to check, it has a hard-coded list of instructions). People are publishing the results at http://instlatx64.atw.hu/

    From the lock instructions, AIDA64 verifies the lock add instructions and xchg [mem] (which is always locking even without an explicit lock prefix).

    Here are some info. I will also give you, for comparison, latencies of the following instructions:

    • xchg reg1, reg2 which is not locking;
    • add to registers and memory.

    As you see, the locking instructions are just 5 times slower on Haswell-DT and just ~2 times slower on Kaby Lake-S than non-locking memory stores.

    Intel Core i5-4430, 3000 MHz (30 x 100) Haswell-DT

    LOCK ADD [m8], r8         L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    LOCK ADD [m16], r16       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    LOCK ADD [m32], r32       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    LOCK ADD [m32 + 8], r32   L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    LOCK ADD [m64], r64       L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    LOCK ADD [m64 + 16], r64  L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    
    XCHG r8, [m8]             L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    XCHG r16, [m16]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    XCHG r32, [m32]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    XCHG r64, [m64]           L: 5.96ns= 17.8c  T: 7.21ns= 21.58c
    
    ADD r32, 0x04000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
    ADD r32, 0x08000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
    ADD r32, 0x10000          L: 0.22ns=  0.9c  T: 0.09ns=  0.36c
    ADD r32, 0x20000          L: 0.22ns=  0.9c  T: 0.08ns=  0.34c
    ADD r8, r8                L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
    ADD r16, r16              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
    ADD r32, r32              L: 0.22ns=  0.9c  T: 0.05ns=  0.23c
    ADD r64, r64              L: 0.22ns=  0.9c  T: 0.07ns=  0.29c
    ADD r8, [m8]              L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
    ADD r16, [m16]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
    ADD r32, [m32]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
    ADD r64, [m64]            L: 1.33ns=  5.6c  T: 0.11ns=  0.47c
    ADD [m8], r8              L: 1.19ns=  5.0c  T: 0.32ns=  1.33c
    ADD [m16], r16            L: 1.19ns=  5.0c  T: 0.21ns=  0.88c
    ADD [m32], r32            L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
    ADD [m32 + 8], r32        L: 1.19ns=  5.0c  T: 0.22ns=  0.92c
    ADD [m64], r64            L: 1.19ns=  5.0c  T: 0.20ns=  0.85c
    ADD [m64 + 16], r64       L: 1.19ns=  5.0c  T: 0.18ns=  0.73c
    

    Intel Core i7-7700K, 4700 MHz (47 x 100) Kaby Lake-S

    LOCK ADD [m8], r8         L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    LOCK ADD [m16], r16       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    LOCK ADD [m32], r32       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    LOCK ADD [m32 + 8], r32   L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    LOCK ADD [m64], r64       L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    LOCK ADD [m64 + 16], r64  L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    
    XCHG r8, [m8]             L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    XCHG r16, [m16]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    XCHG r32, [m32]           L: 4.01ns= 16.8c  T: 5.20ns= 21.83c
    XCHG r64, [m64]           L: 4.01ns= 16.8c  T: 5.12ns= 21.50c
    
    ADD r32, 0x04000          L: 0.33ns=  1.0c  T: 0.12ns=  0.36c
    ADD r32, 0x08000          L: 0.31ns=  0.9c  T: 0.12ns=  0.37c
    ADD r32, 0x10000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
    ADD r32, 0x20000          L: 0.31ns=  0.9c  T: 0.12ns=  0.36c
    ADD r8, r8                L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
    ADD r16, r16              L: 0.31ns=  0.9c  T: 0.11ns=  0.32c
    ADD r32, r32              L: 0.31ns=  0.9c  T: 0.11ns=  0.34c
    ADD r64, r64              L: 0.31ns=  0.9c  T: 0.10ns=  0.31c
    ADD r8, [m8]              L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
    ADD r16, [m16]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
    ADD r32, [m32]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
    ADD r64, [m64]            L: 1.87ns=  5.6c  T: 0.16ns=  0.47c
    ADD [m8], r8              L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
    ADD [m16], r16            L: 1.87ns=  5.6c  T: 0.26ns=  0.78c
    ADD [m32], r32            L: 1.87ns=  5.6c  T: 0.28ns=  0.84c
    ADD [m32 + 8], r32        L: 1.89ns=  5.7c  T: 0.26ns=  0.78c
    ADD [m64], r64            L: 1.89ns=  5.7c  T: 0.33ns=  1.00c
    ADD [m64 + 16], r64       L: 1.89ns=  5.7c  T: 0.24ns=  0.73c
    
    0 讨论(0)
  • 2020-12-31 16:19

    There are few, if any, good references on this because there is so much variation. It depends on basically everything including bus speed, memory speed, processor speed, processor count, surrounding instructions, memory fencing and quite possibly the angle between the moon and Mt Everest...

    If you have a very specific application, as in, known (fixed) hardware, operating environment, a real-time operating system and exclusive control, then maybe it will matter. In this case, benchmark. If you don't have this level of control over where your software is running, any measurements are effectively meaningless.

    As discussed in these answers, locks are implemented using CAS, so if you can get away with CAS instead of a lock (which will need at least two operations) it will be faster (noticeably? only maybe).

    The best references you will find are the Intel Software Developer's Manuals, though since there is so much variation they won't give you an actual number. They will, however, describe how to get the best performance possible. Possibly a processor datasheet (such as those here for the i7 Extreme Edition, under "Technical Documents") will give you actual numbers (or at least a range).

    0 讨论(0)
  • 2020-12-31 16:26

    I've been looking into exponential backoff for a few months now.

    The latency of CAS is utterly dominated by whether or not the instruction can operate from cache or has to operate from memory. Typically, a given memory address is being CAS'd by a number of threads (say, an entry pointer to a queue). If the most recent successful CAS was performed by a logical processor which shares a cache with the current CAS executer (L1, L2 or L3, although of course the higher levels are slower) then the instruction will operate on cache and will be fast - a few cycles. If the most recent successful CAS was performed by a logical core which does not share a cache with the current excutor, then the write of the most recent CASer will have invalidated the cache line for the current executor and a memory read is required - this will take hundreds of cycles.

    The CAS operation itself is very fast - a few cycles - the problem is memory.

    0 讨论(0)
  • 2020-12-31 16:26

    I've been trying to benchmark CAS and DCAS in terms of NOP.

    I have some results, but I don't trust them yet - verification is ongoing.

    Currently, I see on Core i5 for CAS/DCAS 3/5 NOPs. On Xeon, I see 20/22.

    These results may be completely incorrect - you were warned.

    0 讨论(0)
提交回复
热议问题