What is tail call optimization?

前端 未结 10 1274
一整个雨季
一整个雨季 2020-11-21 06:36

Very simply, what is tail-call optimization?

More specifically, what are some small code snippets where it could be applied, and where not, with an explanation of wh

10条回答
  •  既然无缘
    2020-11-21 07:22

    GCC C minimal runnable example with x86 disassembly analysis

    Let's see how GCC can automatically do tail call optimizations for us by looking at the generated assembly.

    This will serve as an extremely concrete example of what was mentioned in other answers such as https://stackoverflow.com/a/9814654/895245 that the optimization can convert recursive function calls to a loop.

    This in turn saves memory and improves performance, since memory accesses are often the main thing that makes programs slow nowadays.

    As an input, we give GCC a non-optimized naive stack based factorial:

    tail_call.c

    #include 
    #include 
    
    unsigned factorial(unsigned n) {
        if (n == 1) {
            return 1;
        }
        return n * factorial(n - 1);
    }
    
    int main(int argc, char **argv) {
        int input;
        if (argc > 1) {
            input = strtoul(argv[1], NULL, 0);
        } else {
            input = 5;
        }
        printf("%u\n", factorial(input));
        return EXIT_SUCCESS;
    }
    

    GitHub upstream.

    Compile and disassemble:

    gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
      -o tail_call.out tail_call.c
    objdump -d tail_call.out
    

    where -foptimize-sibling-calls is the name of generalization of tail calls according to man gcc:

       -foptimize-sibling-calls
           Optimize sibling and tail recursive calls.
    
           Enabled at levels -O2, -O3, -Os.
    

    as mentioned at: How do I check if gcc is performing tail-recursion optimization?

    I choose -O1 because:

    • the optimization is not done with -O0. I suspect that this is because there are required intermediate transformations missing.
    • -O3 produces ungodly efficient code that would not be very educative, although it is also tail call optimized.

    Disassembly with -fno-optimize-sibling-calls:

    0000000000001145 :
        1145:       89 f8                   mov    %edi,%eax
        1147:       83 ff 01                cmp    $0x1,%edi
        114a:       74 10                   je     115c 
        114c:       53                      push   %rbx
        114d:       89 fb                   mov    %edi,%ebx
        114f:       8d 7f ff                lea    -0x1(%rdi),%edi
        1152:       e8 ee ff ff ff          callq  1145 
        1157:       0f af c3                imul   %ebx,%eax
        115a:       5b                      pop    %rbx
        115b:       c3                      retq
        115c:       c3                      retq
    

    With -foptimize-sibling-calls:

    0000000000001145 :
        1145:       b8 01 00 00 00          mov    $0x1,%eax
        114a:       83 ff 01                cmp    $0x1,%edi
        114d:       74 0e                   je     115d 
        114f:       8d 57 ff                lea    -0x1(%rdi),%edx
        1152:       0f af c7                imul   %edi,%eax
        1155:       89 d7                   mov    %edx,%edi
        1157:       83 fa 01                cmp    $0x1,%edx
        115a:       75 f3                   jne    114f 
        115c:       c3                      retq
        115d:       89 f8                   mov    %edi,%eax
        115f:       c3                      retq
    

    The key difference between the two is that:

    • the -fno-optimize-sibling-calls uses callq, which is the typical non-optimized function call.

      This instruction pushes the return address to the stack, therefore increasing it.

      Furthermore, this version also does push %rbx, which pushes %rbx to the stack.

      GCC does this because it stores edi, which is the first function argument (n) into ebx, then calls factorial.

      GCC needs to do this because it is preparing for another call to factorial, which will use the new edi == n-1.

      It chooses ebx because this register is callee-saved: What registers are preserved through a linux x86-64 function call so the subcall to factorial won't change it and lose n.

    • the -foptimize-sibling-calls does not use any instructions that push to the stack: it only does goto jumps within factorial with the instructions je and jne.

      Therefore, this version is equivalent to a while loop, without any function calls. Stack usage is constant.

    Tested in Ubuntu 18.10, GCC 8.2.

提交回复
热议问题