问题
I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0
:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer
:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived
that calls the Base
base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base
contains.
Question:
- Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with
-O0
or-O1
and there is no way around them? - Why is the
subq $8, %rsp
statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?
回答1:
I don't see any obvious missed optimizations in your -O1
output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer
so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this
in rdi
. Which g++ spills to the stack because of -O0
. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo
: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og
or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0
. Preferably -O2
or -O3
unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og
or -O1
will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi
is just a side-effect of -O0
. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0
is to compile quickly, as well as consistent debugging.
Why is the
subq $8, %rsp
statement generated at all?
Because the ABI requires 16-byte stack alignment before a call
instruction, and this function did an even number of 8-byte push
es. (call
itself pushes a return address). It will go away at -O1
without -fno-omit-frame-pointer
because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx
/pop
or something, depending on -mtune
options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0
. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly
回答2:
is there a reason the compiler cannot detect these cases with
-O0
or-O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.
来源:https://stackoverflow.com/questions/58131883/remove-needless-assembler-statements-from-g-output