Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

前端 未结 1 701
感动是毒
感动是毒 2021-02-05 00:54

In a research project of mine I\'m writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn\'t provide direct access to flag mani

相关标签:
1条回答
  • 2021-02-05 01:37

    mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1


    This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.

    I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".


    The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)

    That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.

    e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.

    So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.


    What's the point of using it since it's up to me to provide the carry flag?

    You're using _addcarry_u32 correctly.

    The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.

    If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.

    e.g. to add two 128-bit integers in 32-bit chunks, you can do this

    // bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
    // even though __restrict guarantees non-overlap.
    void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
    {
        unsigned char carry;
        carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
        carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
        carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
        carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
    }
    

    (On Godbolt with GCC/clang/ICC)

    That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.

    GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.


    Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.

    But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.

    On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.

    So equal total uop count but worse latency means that adc would still be a better choice.

    https://agner.org/optimize/

    0 讨论(0)
提交回复
热议问题