What's up with gcc weird stack manipulation when it wants extra stack alignment?

后端 未结 1 410
不知归路
不知归路 2021-01-19 16:23

I\'ve seen this r10 weirdness a few times, so let\'s see if anyone knows what\'s up.

Take this simple function:

#define SZ 4

void sink         


        
相关标签:
1条回答
  • 2021-01-19 16:42

    Well, you answered your question: The stack pointer needs to be aligned to 32 bytes before it can be accessed with aligned AVX2 loads and stores, but the ABI only provides 16 byte alignment. Since the compiler cannot know how much the alignment is off, it has to save the stack pointer in a scratch register and restore it afterwards. But the saved value has to outlive the function call, so it has to be put on the stack, and a stack frame has to be created.

    Some x86-64 ABIs have a red zone (a region of the stack below the stack pointer which is not used by signal handlers), so it is feasible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization and it would not apply here anyway because of the function call at the end.

    In addition, the default stack alignment implementation is rather poor. For this case, -maccumulate-outgoing-args results in better-looking code with GCC 6, just aligning RSP after saving RBP, instead of copying the return address before saving RBP:

    andpop:
            pushq   %rbp
            movq    %rsp, %rbp            # make a traditional stack frame
            andq    $-32, %rsp            # reserve 0 or 16 bytes
            subq    $32, %rsp
    
            vmovdqu (%rdi), %xmm0         # split unaligned load from tune=generic
            vinserti128     $0x1, 16(%rdi), %ymm0, %ymm0   # use -march=haswell instead
            movq    %rsp, %rdi
            vpaddq  .LC0(%rip), %ymm0, %ymm0
            vmovdqa %ymm0, (%rsp)
    
            vzeroupper
            call    sink@PLT
            leave
            ret
    

    (editor's note: gcc8 and later make asm like this by default (Godbolt compiler explorer with gcc8, clang7, ICC19, and MSVC), even without -maccumulate-outgoing-args)


    This issue (GCC generating poor code for stack alignment) recently came up when we had to implement a workaround for GCC __tls_get_addr ABI bug, and we ended up writing the stack realignment by hand.

    EDIT There is also another issue, related to RTL pass ordering: stack alignment is picked before the final determination whether the stack is actually needed, as BeeOnRope's second example shows.

    0 讨论(0)
提交回复
热议问题