This question may sound fairly elementary, but this is a debate I had with another developer I work with.
I was taking care to stack allocate things where I could, i
It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.
The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.
As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.
Honestly, it's trivial to write a program to compare the performance:
#include <ctime>
#include <iostream>
namespace {
class empty { }; // even empty classes take up 1 byte of space, minimum
}
int main()
{
std::clock_t start = std::clock();
for (int i = 0; i < 100000; ++i)
empty e;
std::clock_t duration = std::clock() - start;
std::cout << "stack allocation took " << duration << " clock ticks\n";
start = std::clock();
for (int i = 0; i < 100000; ++i) {
empty* e = new empty;
delete e;
};
duration = std::clock() - start;
std::cout << "heap allocation took " << duration << " clock ticks\n";
}
It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.
An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.
I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.
Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.
If I cared about nanosecond precision I wouldn't use std::clock()
. If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.
The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:
Add a data member to empty
, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."
Declare e
volatile
, but volatile is often compiled incorrectly (PDF).
Take the address of e
inside the loop (and maybe assign it to a variable that is declared extern
and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e
will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.
Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete
would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete
after we've got our time measurement.
On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.
Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.
In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.
Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):
#include <cstdio>
#include <chrono>
namespace {
void on_stack()
{
int i;
}
void on_heap()
{
int* i = new int;
delete i;
}
}
int main()
{
auto begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_stack();
auto end = std::chrono::system_clock::now();
std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());
begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_heap();
end = std::chrono::system_clock::now();
std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
return 0;
}
displays:
on_stack took 2.070003 seconds
on_heap took 57.980081 seconds
on my system when compiled with the command line cl foo.cc /Od /MT /EHsc
.
You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:
on_stack took 0.000000 seconds
on_heap took 51.608723 seconds
Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that on_stack
doesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices that on_heap
doesn't do anything useful, and optimizes it away as well:
on_stack took 0.000003 seconds
on_heap took 0.000002 seconds
I'd like to say actually code generate by GCC (I remember VS also) doesn't have overhead to do stack allocation.
Say for following function:
int f(int i)
{
if (i > 0)
{
int array[1000];
}
}
Following is the code generate:
__Z1fi:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $**3880**, %rsp <--- here we have the array allocated, even the if doesn't excited.
Ltmp2:
movl %edi, -4(%rbp)
movl -8(%rbp), %eax
addq $3880, %rsp
popq %rbp
ret
Leh_func_end1:
So whatevery how much local variable you have (even inside if or switch), just the 3880 will change to another value. Unless you didn't have local variable, this instruction just need to execute. So allocate local variable doesn't have overhead.
Remark that the considerations are typically not about speed and performance when choosing stack versus heap allocation. The stack acts like a stack, which means it is well suited for pushing blocks and popping them again, last in, first out. Execution of procedures is also stack-like, last procedure entered is first to be exited. In most programming languages, all the variables needed in a procedure will only be visible during the procedure's execution, thus they are pushed upon entering a procedure and popped off the stack upon exit or return.
Now for an example where the stack cannot be used:
Proc P
{
pointer x;
Proc S
{
pointer y;
y = allocate_some_data();
x = y;
}
}
If you allocate some memory in procedure S and put it on the stack and then exit S, the allocated data will be popped off the stack. But the variable x in P also pointed to that data, so x is now pointing to some place underneath the stack pointer (assume stack grows downwards) with an unknown content. The content might still be there if the stack pointer is just moved up without clearing the data beneath it, but if you start allocating new data on the stack, the pointer x might actually point to that new data instead.
Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.
If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.
Ketan
Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.
Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.