Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

前端 未结 2 1901
梦谈多话
梦谈多话 2020-11-28 15:01

ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Sk

相关标签:
2条回答
  • 2020-11-28 15:47

    According to my microbenchmarks, the results of which can be found on uops.info, this optimization was introduced with Sandy Bridge (http://uops.info/html-tp/SNB/ADC_R64_I8-Measurements.html). Westmere does not do this optimization (http://uops.info/html-tp/WSM/ADC_R64_I8-Measurements.html). The data was obtained using a Core i7-2600, and a Core i5-650.

    Furthermore, the data on uops.info shows that the optimization is not performed if an 8-bit register is used (Sandy Bridge, Ivy Bridge, Haswell).

    0 讨论(0)
  • 2020-11-28 16:03

    It's not present on Nehalem, but is on IvyBridge. So it was new either in Sandybridge or IvB.

    My guess is Sandybridge for this, because that was a major redesign of the decoders (producing up to 4 total uops, rather than patterns like 4+1+1+1 that were possible in Core2 / Nehalem), and hanging on to instructions that can macro-fuse (like add or sub) if they're the last in a group in case the next instruction is a jcc.

    Significantly for this, I think SnB decoders also look at the imm8 in immediate-count shifts to check if it's zero, instead of only doing that in the execution units2.

    Hard data so far:

    • Broadwell and later (and AMD, and Silvermont/KNL) don't need this optimization, adc r,imm and adc r,r are always 1 uop, except for the AL/AX/EAX/RAX imm short form1 on Broadwell/Skylake.
    • Haswell does this optimization: adc reg,0 is 1 uop, adc reg,1 is 2. For 32 and 64-bit operand-size, not 8-bit.
    • IvyBridge i7-3630QM does this optimization (thanks @DavidWohlferd).
    • Sandybridge ???
    • Nehalem i7-820QM does not, adc is slower than add regardless of the imm.
    • Core 2 E6600 (Conroe/Merom) doesn't either.
    • Safe to assume Pentium M and earlier don't.

    Footnote 1: On Skylake, the al/ax/eax/rax, imm8/16/32/32 short-form encodings with no ModR/M byte still decode to 2 uops, even when the immediate is zero. For example, adc eax, strict dword 0 (15 00 00 00 00) is twice as slow as 83 d0 00. Both uops are on the critical path for latency.

    Looks like Intel forgot to update the decoding for the other immediate forms of adc and sbb! (This all applies equally to both ADC and SBB.)

    Assemblers will use the short-form by default for immediates that don't fit in an imm8, so for example adc rax, 12345 assembles to 48 15 39 30 00 00 instead of the one-byte larger single-uop form that is the only option for registers other than the accumulator.

    A loop that bottlenecks on adc rcx, 12345 instead of RAX latency runs twice as fast. But adc rax, 123 is unaffected, because it uses the adc r/m64, imm8 encoding which is single uop.


    Footnote 2: See INC instruction vs ADD 1: Does it matter? for quotes from Intel's optimization manual about Core2 stalling the front-end if a later instruction reads flags from a shl r/m32, imm8, in case the imm8 was 0. (As opposed to the implicit-1 opcode, which the decoder knows always writes flags.)

    But SnB-family doesn't do that; the decoder apparently checks the imm8 to see whether the instruction writes flags unconditionally or whether it leaves them untouched. So checking an imm8 is something that SnB decoders already do, and could usefully do for adc to omit the uop that adds that input, leaving only adding CF to the destination.

    0 讨论(0)
提交回复
热议问题