I\'m trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here
I can deduce from the quotes above that the stack is mapped
That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push
or call
instruction in _start
without making a system call from user-space to allocate a stack.
In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.
The question is whether the "main thread"'s stack uses
MAP_GROWSDOWN | MAP_STACK
mapping or maybe even viasbrk
?
The ELF binary loader sets the _GROWSDOWN
flag for the main thread's stack, but not the MAP_STACK
flag. This is code inside the kernel, and it does not go through the regular mmap
system call interface.
(Nothing in user-space uses mmap(MAP_GROWSDOWN)
so normally the main thread stack is the only mapping that have the VM_GROWSDOWN
flag inside the kernel.)
The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN
. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN
, VM_READ
, VM_WRITE
, VM_MAYREAD
, VM_MAYWRITE
, and VM_MAYEXEC
. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack
), the VM_EXEC
flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP
is used instead of VM_GROWSDOWN
if the kernel was compiled with CONFIG_STACK_GROWSUP
defined. The line of code where these flags are specified in the Linux kernel can be found here.
/proc/.../maps
and pmap
don't use the VM_GROWSDOWN
- they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps
looks for the VM_GROWSDOWN
flag and marks each memory region that has this flag as gd
. (Although it seems to ignore VM_GROWSUP
.)
All of these tools/files ignore the MAP_STACK
flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.
sbrk
makes no sense here; the stack isn't contiguous with the "break", and the brk
heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk
.
And no, nothing uses MAP_GROWSDOWN
, not even secondary thread stacks, because it can't in general be used safely.
The mmap(2)
man page which says MAP_GROWSDOWN
is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN
is typically broken, and proposed removing the flag from Linux mmap
and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)
MAP_GROWSDOWN
sets the VM_GROWSDOWN
flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s
?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca
.)
(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)
MAP_GROWSDOWN
mapping can also grow the way the mmap
man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.
But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN)
. It reserves the growth space up to ulimit -s
to prevent random choice of mmap
address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve()
, making it safe from an mmap(NULL, ...)
randomly blocking future stack growth.
mmap(MAP_FIXED)
could still create a roadblock for the main stack, but if you use MAP_FIXED
you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED
will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE
or just a non-NULL hint address)
See
pthread_create
doesn't use MAP_GROWSDOWN
for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.
The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN
may even be buggy outside of the way the standard main-stack VM_GROWSDOWN
mapping is used.