How can I safely average two unsigned ints in C++?

后端未结

关注

 10  885

情书的邮戳 2020-12-07 19:05

Using integer math alone, I\'d like to \"safely\" average two unsigned ints in C++.

What I mean by \"safely\" is avoiding overflows (and anything else that can be th

10条回答

囚心锁ツ (楼主)

2020-12-07 19:18
If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.

Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:
```
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
    unsigned result;
    asm("add   %[x], %[res]\n\t"
        "rcr   %[res]"
        : [res] "=r" (result)   // output
        : [y] "%0"(y),  // input: in the same reg as results output.  Commutative with next operand
          [x] "rme"(x)  // input: reg, mem, or immediate
        :               // no clobbers.  ("cc" is implicit on x86)
    );
    return result;
}
```
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.

As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.

RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than @sellibitze's version, but worse than @Sheldon's order-dependent version. (See code on Godbolt)

But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...