Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

前端 未结 1 1759
野趣味
野趣味 2021-01-17 11:37

TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs

相关标签:
1条回答
  • 2021-01-17 12:12

    First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.


    Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?

    Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.

    See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.

    add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.


    Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.

    -O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on


    As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,

    Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.

    whereas in the first loop we can immediately refer the memory

    Yes

    and in the meanwhile LEAL will compute the next address

    No.

    Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add


    The latency issues might not be the loop throughput bottleneck:

    uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)

    When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.

    Assuming an Intel SnB-family CPU:

    ## ptr++: 1st loop
    movl    -12(%ebp), %eax   //1 uop (load)
    leal    4(%eax), %edx     //1 uop (ALU only)
    movl    %edx, -12(%ebp)   //1 store-address, 1 store-data
    //   no load from -12(%ebp) into %eax
    ... rest the same.
    
    
     ## ++ptr:  2nd loop
    addl    $4, -12(%ebp)       // read-modify-write: 2 fused-domain uops.  4 unfused: 1 load + 1 store-address + 1 store-data
    movl    -12(%ebp), %eax     // load: 1 uop.   ~6 cycle latency for %eax to be ready
    ... rest the same
    

    So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).

    However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).

    We could test this theory:

    Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.

    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    
    mov     -12(%ebp), %edx 
    

    %edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).

    So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.


    Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.

    Why extra uops matter when we're not hitting execution port throughput bottlenecks

    uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.

    i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.

    Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.


    Other guesses:

    So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.

    Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.

    0 讨论(0)
提交回复
热议问题