How can I safely average two unsigned ints in C++?

后端 未结 10 885
情书的邮戳
情书的邮戳 2020-12-07 19:05

Using integer math alone, I\'d like to \"safely\" average two unsigned ints in C++.

What I mean by \"safely\" is avoiding overflows (and anything else that can be th

10条回答
  •  囚心锁ツ
    2020-12-07 19:18

    If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.

    Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:

    // works for 64-bit long as well on x86-64, and doesn't depend on calling convention
    unsigned average(unsigned x, unsigned y)
    {
        unsigned result;
        asm("add   %[x], %[res]\n\t"
            "rcr   %[res]"
            : [res] "=r" (result)   // output
            : [y] "%0"(y),  // input: in the same reg as results output.  Commutative with next operand
              [x] "rme"(x)  // input: reg, mem, or immediate
            :               // no clobbers.  ("cc" is implicit on x86)
        );
        return result;
    }
    

    The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.

    As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.


    RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than @sellibitze's version, but worse than @Sheldon's order-dependent version. (See code on Godbolt)

    But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.

提交回复
热议问题