passing rvalue to non-ref parameter, why can't the compiler elide the copy?

struct Big {
    int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
    foo(getStuff());
}

compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell) to

test1():                              # @test1()
        sub     rsp, 72
        lea     rdi, [rsp + 40]
        call    getStuff()
        vmovups ymm0, ymmword ptr [rsp + 40]
        vmovups ymmword ptr [rsp], ymm0
        vzeroupper
        call    foo(Big)
        add     rsp, 72
        ret

If I am reading this correctly, this is what is happening:

getStuff is passed a pointer to foo's stack (rsp + 40) to use for its return value, so after getStuff returns rsp + 40 through to rsp + 71 contains the result of getStuff.
This result is then immediately copied to a lower stack address rsp through to rsp + 31.
foo is then called, which will read its argument from rsp.

Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)?

test1():                              # @test1()
        sub     rsp, 32
        mov     rdi, rsp
        call    getStuff()
        call    foo(Big)
        add     rsp, 32
        ret

The idea is: have getStuff write directly to the place in the stack that foo will read from.

Also: Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused!

_TEXT   SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC                    ; bar, COMDAT

$LN4:
    sub rsp, 88                 ; 00000058H

    lea rcx, QWORD PTR $T1[rsp]
    call    ?getStuff@@YA?AUBig@@XZ         ; getStuff
    lea rcx, QWORD PTR $T3[rsp]
    movups  xmm0, XMMWORD PTR [rax]
    movaps  XMMWORD PTR $T3[rsp], xmm0
    movups  xmm1, XMMWORD PTR [rax+16]
    movaps  XMMWORD PTR $T3[rsp+16], xmm1
    movups  xmm0, XMMWORD PTR [rax+32]
    movaps  XMMWORD PTR $T3[rsp+32], xmm0
    call    ?foo@@YAHUBig@@@Z           ; foo

    add rsp, 88                 ; 00000058H
    ret 0

You're right; this looks like a missed-optimization by the compiler. You can report this bug (https://bugs.llvm.org/) if there isn't already a duplicate.

Contrary to popular belief, compilers often don't make optimal code. It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one.

x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big).

gcc7.3, ICC18, and MSVC CL19 also miss this optimization. :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC. gcc uses 4x push qword [rsp+24] to copy, while ICC uses extra instructions to align the stack by 32.

Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper for MSVC / ICC / clang, for a function this small. vzeroupper is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell to tune for that, not for AMD or KNL where it's more expensive.

Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. The callee owns the pointed-to memory. (What happens at assembly level when you have functions with large inputs)

This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff(), and allowing the callee to destroy the temporary because we don't need it later.

That's not actually what MSVC does here or in related cases, though, unfortunately.

See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization?. Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed).

But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee.

There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs. Consider some other function that does global_pointer->a[4]=0;. If our callee calls that function, it will unexpectedly modify our callee's by-value arg.

So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object.

来源：https://stackoverflow.com/questions/49474685/passing-rvalue-to-non-ref-parameter-why-cant-the-compiler-elide-the-copy

标签

c++

clang

x86-64

compiler-optimization

abi