How to optimize function return values in C and C++ on x86-64?

问题

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.

Assuming that x86-64 is the only targeted platform, which of these two functions:

uint64_t f(uint64_t * const secondReturnValue) {
    /* Calculate a and b. */
    *secondReturnValue = b;
    return a;
}

std::pair<uint64_t, uint64_t> g() {
    /* Calculate a and b, same as in f() above. */
    return { a, b };
}

would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?

UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:

movq b, (%rdi)
movq a, %rax
retq

But in case of g() it suffices for the compiler to do:

movq a, %rax
movq b, %rdx
retq

Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.

回答1:

Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.

However, for such tiny functions I guess that most of the performance will come from inlining.

You may want to compile and link with g++ -flto -O2 using link-time optimizations.

I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.

But you really should benchmark if you care that much.

回答2:

Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.

This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)

Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):

#include <cstdint>
#include <utility>
#define type_t uint32_t

type_t f(type_t * const secondReturnValue, type_t x) {
    *secondReturnValue = x+4;
    return x+2;
}
    lea     eax, [rsi+4]           # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
    mov     DWORD PTR [rdi], eax
    lea     eax, [rsi+2]
    ret

std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
    lea     eax, [rdi+4]
    lea     edx, [rdi+2]
    sal     rax, 32
    or      rax, rdx
    ret

type_t use_pair(std::pair<type_t, type_t> pair) {
    return pair.second + pair.first;
}
    mov     rax, rdi
    shr     rax, 32
    add     eax, edi
    ret

So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.

If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.

If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

来源：https://stackoverflow.com/questions/25381736/how-to-optimize-function-return-values-in-c-and-c-on-x86-64

标签

c++

performance

x86-64

abi