Very simply, what is tail-call optimization?
More specifically, what are some small code snippets where it could be applied, and where not, with an explanation of wh
GCC C minimal runnable example with x86 disassembly analysis
Let's see how GCC can automatically do tail call optimizations for us by looking at the generated assembly.
This will serve as an extremely concrete example of what was mentioned in other answers such as https://stackoverflow.com/a/9814654/895245 that the optimization can convert recursive function calls to a loop.
This in turn saves memory and improves performance, since memory accesses are often the main thing that makes programs slow nowadays.
As an input, we give GCC a non-optimized naive stack based factorial:
tail_call.c
#include
#include
unsigned factorial(unsigned n) {
if (n == 1) {
return 1;
}
return n * factorial(n - 1);
}
int main(int argc, char **argv) {
int input;
if (argc > 1) {
input = strtoul(argv[1], NULL, 0);
} else {
input = 5;
}
printf("%u\n", factorial(input));
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and disassemble:
gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
-o tail_call.out tail_call.c
objdump -d tail_call.out
where -foptimize-sibling-calls
is the name of generalization of tail calls according to man gcc
:
-foptimize-sibling-calls
Optimize sibling and tail recursive calls.
Enabled at levels -O2, -O3, -Os.
as mentioned at: How do I check if gcc is performing tail-recursion optimization?
I choose -O1
because:
-O0
. I suspect that this is because there are required intermediate transformations missing.-O3
produces ungodly efficient code that would not be very educative, although it is also tail call optimized.Disassembly with -fno-optimize-sibling-calls
:
0000000000001145 :
1145: 89 f8 mov %edi,%eax
1147: 83 ff 01 cmp $0x1,%edi
114a: 74 10 je 115c
114c: 53 push %rbx
114d: 89 fb mov %edi,%ebx
114f: 8d 7f ff lea -0x1(%rdi),%edi
1152: e8 ee ff ff ff callq 1145
1157: 0f af c3 imul %ebx,%eax
115a: 5b pop %rbx
115b: c3 retq
115c: c3 retq
With -foptimize-sibling-calls
:
0000000000001145 :
1145: b8 01 00 00 00 mov $0x1,%eax
114a: 83 ff 01 cmp $0x1,%edi
114d: 74 0e je 115d
114f: 8d 57 ff lea -0x1(%rdi),%edx
1152: 0f af c7 imul %edi,%eax
1155: 89 d7 mov %edx,%edi
1157: 83 fa 01 cmp $0x1,%edx
115a: 75 f3 jne 114f
115c: c3 retq
115d: 89 f8 mov %edi,%eax
115f: c3 retq
The key difference between the two is that:
the -fno-optimize-sibling-calls
uses callq
, which is the typical non-optimized function call.
This instruction pushes the return address to the stack, therefore increasing it.
Furthermore, this version also does push %rbx
, which pushes %rbx to the stack.
GCC does this because it stores edi
, which is the first function argument (n
) into ebx
, then calls factorial
.
GCC needs to do this because it is preparing for another call to factorial
, which will use the new edi == n-1
.
It chooses ebx
because this register is callee-saved: What registers are preserved through a linux x86-64 function call so the subcall to factorial
won't change it and lose n
.
the -foptimize-sibling-calls
does not use any instructions that push to the stack: it only does goto
jumps within factorial
with the instructions je
and jne
.
Therefore, this version is equivalent to a while loop, without any function calls. Stack usage is constant.
Tested in Ubuntu 18.10, GCC 8.2.