Why does the x86-64 GCC function prologue allocate less stack than the local variables?

后端 未结 1 1570
北恋
北恋 2020-11-30 10:35

Consider the following simple program:

int main(int argc, char **argv)
{
        char buffer[256];

        buffer[0] = 0x41;
        buffer[128] = 0x41;
            


        
相关标签:
1条回答
  • 2020-11-30 10:41

    The x86-64 ABI used by Linux (and some other OSes, although notably not Windows, which has its own different ABI) defines a "red zone" of 128 bytes below the stack pointer, which is guaranteed not to be touched by signal or interrupt handlers. (See figure 3.3 and §3.2.2.)

    A leaf function (i.e. one which does not call anything else) may therefore use this area for whatever it wants - it isn't doing anything like a call which would place data at the stack pointer; and any signal or interrupt handler will follow the ABI and drop the stack pointer by at least an additional 128 bytes before storing anything.

    (Shorter instruction encodings are available for signed 8-bit displacements, so the point of the red zone is that it increases the amount of local data that a leaf function can access using these shorter instructions.)

    That's what's happening here.

    But... this code isn't making use of those shorter encodings (it's using offsets from rbp rather than rsp). Why not? It's also saving edi and rsi completely unnecessarily - you ask why it's saving edi instead of rdi, but why is it saving it at all?

    The answer is that the compiler is generating really crummy code, because no optimisations are enabled. If you enable any optimisation, your entire function is likely to collapse down to:

    mov eax, 0
    ret
    

    because that's really all it needs to do: buffer[] is local, so the changes made to it will never be visible to anything else, so can be optimised away; beyond that, all the function needs to do is return 0.


    So, here's a better example. This function is complete nonsense, but makes use of a similar array, whilst doing enough to ensure that things don't all get optimised away:

    $ cat test.c
    int foo(char *bar)
    {
        char tmp[256];
        int i;
    
        for (i = 0; bar[i] != 0; i++)
          tmp[i] = bar[i] + i;
    
        return tmp[1] + tmp[200];
    }
    

    Compiled with some optimisation, you can see similar use of the red zone, except this time it really does use offsets from rsp:

    $ gcc -m64 -O1 -c test.c
    $ objdump -Mintel -d test.o
    
    test.o:     file format elf64-x86-64
    
    
    Disassembly of section .text:
    
    0000000000000000 <foo>:
       0:   53                      push   rbx
       1:   48 81 ec 88 00 00 00    sub    rsp,0x88
       8:   0f b6 17                movzx  edx,BYTE PTR [rdi]
       b:   84 d2                   test   dl,dl
       d:   74 26                   je     35 <foo+0x35>
       f:   4c 8d 44 24 88          lea    r8,[rsp-0x78]
      14:   48 8d 4f 01             lea    rcx,[rdi+0x1]
      18:   4c 89 c0                mov    rax,r8
      1b:   89 c3                   mov    ebx,eax
      1d:   44 28 c3                sub    bl,r8b
      20:   89 de                   mov    esi,ebx
      22:   01 f2                   add    edx,esi
      24:   88 10                   mov    BYTE PTR [rax],dl
      26:   0f b6 11                movzx  edx,BYTE PTR [rcx]
      29:   48 83 c0 01             add    rax,0x1
      2d:   48 83 c1 01             add    rcx,0x1
      31:   84 d2                   test   dl,dl
      33:   75 e6                   jne    1b <foo+0x1b>
      35:   0f be 54 24 50          movsx  edx,BYTE PTR [rsp+0x50]
      3a:   0f be 44 24 89          movsx  eax,BYTE PTR [rsp-0x77]
      3f:   8d 04 02                lea    eax,[rdx+rax*1]
      42:   48 81 c4 88 00 00 00    add    rsp,0x88
      49:   5b                      pop    rbx
      4a:   c3                      ret    
    

    Now let's tweak it very slightly, by inserting a call to another function, so that foo() is no longer a leaf function:

    $ cat test.c
    extern void dummy(void);  /* ADDED */
    
    int foo(char *bar)
    {
        char tmp[256];
        int i;
    
        for (i = 0; bar[i] != 0; i++)
          tmp[i] = bar[i] + i;
    
        dummy();  /* ADDED */
    
        return tmp[1] + tmp[200];
    }
    

    Now the red zone cannot be used, so you see something more like you originally expected:

    $ gcc -m64 -O1 -c test.c
    $ objdump -Mintel -d test.o
    
    test.o:     file format elf64-x86-64
    
    
    Disassembly of section .text:
    
    0000000000000000 <foo>:
       0:   53                      push   rbx
       1:   48 81 ec 00 01 00 00    sub    rsp,0x100
       8:   0f b6 17                movzx  edx,BYTE PTR [rdi]
       b:   84 d2                   test   dl,dl
       d:   74 24                   je     33 <foo+0x33>
       f:   49 89 e0                mov    r8,rsp
      12:   48 8d 4f 01             lea    rcx,[rdi+0x1]
      16:   48 89 e0                mov    rax,rsp
      19:   89 c3                   mov    ebx,eax
      1b:   44 28 c3                sub    bl,r8b
      1e:   89 de                   mov    esi,ebx
      20:   01 f2                   add    edx,esi
      22:   88 10                   mov    BYTE PTR [rax],dl
      24:   0f b6 11                movzx  edx,BYTE PTR [rcx]
      27:   48 83 c0 01             add    rax,0x1
      2b:   48 83 c1 01             add    rcx,0x1
      2f:   84 d2                   test   dl,dl
      31:   75 e6                   jne    19 <foo+0x19>
      33:   e8 00 00 00 00          call   38 <foo+0x38>
      38:   0f be 94 24 c8 00 00    movsx  edx,BYTE PTR [rsp+0xc8]
      3f:   00 
      40:   0f be 44 24 01          movsx  eax,BYTE PTR [rsp+0x1]
      45:   8d 04 02                lea    eax,[rdx+rax*1]
      48:   48 81 c4 00 01 00 00    add    rsp,0x100
      4f:   5b                      pop    rbx
      50:   c3                      ret    
    

    (Note that tmp[200] was in range of a signed 8-bit displacement in the first case, but is not in this one.)

    0 讨论(0)
提交回复
热议问题