问题
A lot of literature talks about using inline functions to "avoid the overhead of a function call". However I haven't seen quantifiable data. What is the actual overhead of a function call i.e. what sort of performance increase do we achieve by inlining functions?
回答1:
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpage has a description of what's involved in the various calling conventions.
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
回答2:
There's the technical and the practical answer. The practical answer is it will never matter, and in the very rare case it does the only way you'll know is through actual profiled tests.
The technical answer, which your literature refers to, is generally not relevant due to compiler optimizations. But if you're still interested, is well described by Josh.
As far as a "percentage" you'd have to know how expensive the function itself was. Outside of the cost of the called function there is no percentage because you are comparing to a zero cost operation. For inlined code there is no cost, the processor just moves to the next instruction. The downside to inling is a larger code size which manifests it's costs in a different way than the stack construction/tear down costs.
回答3:
I made a simple benchmark against a simple increment function:
inc.c:
typedef unsigned long ulong;
ulong inc(ulong x){
return x+1;
}
main.c
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
#ifdef EXTERN
ulong inc(ulong);
#else
static inline ulong inc(ulong x){
return x+1;
}
#endif
int main(int argc, char** argv){
if (argc < 1+1)
return 1;
ulong i, sum = 0, cnt;
cnt = atoi(argv[1]);
for(i=0;i<cnt;i++){
sum+=inc(i);
}
printf("%lu\n", sum);
return 0;
}
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz gave me:
- 1.4 seconds for the inlinining version
- 4.4 seconds for the regularly linked version
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).
回答4:
The amount of overhead will depend on the compiler, CPU, etc. The percentage overhead will depend on the code you're inlining. The only way to know is to take your code and profile it both ways - that's why there's no definitive answer.
回答5:
Your question is one of the questions, that has no answer one could call the "absolute truth". The overhead of a normal function call depends on three factors:
The CPU. The overhead of x86, PPC, and ARM CPUs varies a lot and even if you just stay with one architecture, the overhead also varies quite a bit between an Intel Pentium 4, Intel Core 2 Duo and an Intel Core i7. The overhead might even vary noticeably between an Intel and an AMD CPU, even if both run at the same clock speed, since factors like cache sizes, caching algorithms, memory access patterns and the actual hardware implementation of the call opcode itself can have a huge influence on the overhead.
The ABI (Application Binary Interface). Even with the same CPU, there often exist different ABIs that specify how function calls pass parameters (via registers, via stack, or via a combination of both) and where and how stack frame initialization and clean-up takes place. All this has an influence on the overhead. Different operating systems may use different ABIs for the same CPU; e.g. Linux, Windows and Solaris may all three use a different ABI for the same CPU.
The Compiler. Strictly following the ABI is only important if functions are called between independent code units, e.g. if an application calls a function of a system library or a user library calls a function of another user library. As long as functions are "private", not visible outside a certain library or binary, the compiler may "cheat". It may not strictly follow the ABI but instead use shortcuts that lead to faster function calls. E.g. it may pass parameters in register instead of using the stack or it may skip stack frame setup and clean-up completely if not really necessary.
If you want to know the overhead for a specific combination of the three factors above, e.g. for Intel Core i5 on Linux using GCC, your only way to get this information is benchmarking the difference between two implementations, one using function calls and one where you copy the code directly into the caller; this way you force inlining for sure, since the inline statement is only a hint and does not always lead to inlining.
However, the real question here is: Does the exact overhead really matter? One thing is for sure: A function call always has an overhead. It may be small, it may be big, but it is for sure existent. And no matter how small it is if a function is called often enough in a performance critical section, the overhead will matter to some degree. Inlining rarely makes your code slower, unless you terribly overdo it; it will make the code bigger though. Today's compilers are pretty good at deciding themselves when to inline and when not, so you hardly ever have to rack your brain about it.
Personally I ignore inlining during development completely, until I have a more or less usable product that I can profile and only if profiling tells me, that a certain function is called really often and also within a performance critical section of the application, then I will consider "force-inlining" of this function.
So far my answer is very generic, it applies to C as much as it applies to C++ and Objective-C. As a closing word let me say something about C++ in particular: Methods that are virtual are double indirect function calls, that means they have a higher function call overhead than normal function calls and also they cannot be inlined. Non-virtual methods might be inlined by the compiler or not but even if they are not inlined, they are still significant faster than virtual ones, so you should not make methods virtual, unless you really plan to override them or have them overridden.
回答6:
For very small functions inlining makes sense, because the (small) cost of the function call is significant relative to the (very small) cost of the function body. For most functions over a few lines it's not a big win.
回答7:
It's worth pointing out that an inlined function increases the size of the calling function and anything that increases the size of a function may have a negative affect on caching. If you're right at a boundary, "just one more wafer thin mint" of inlined code might have a dramatically negative effect on performance.
If you're reading literature that's warning about "the cost of a function call," I'd suggest it may be older material that doesn't reflect modern processors. Unless you're in the embedded world, the era in which C is a "portable assembly language" has essentially passed. A large amount of the ingenuity of the chip designers in the past decade (say) has gone into all sorts of low-level complexities that can differ radically from the way things worked "back in the day."
回答8:
There is a great concept called 'register shadowing', which allows to pass ( up to 6 ? ),values thru registers ( on CPU ) instead of stack ( memory ). Also, depending on the function and variables used within, compiler may just decide that frame management code is not required !!
Also, even C++ compiler may do a 'tail recursion optimiztaion', i.e. if A() calls B(), and after calling B(), A just returns, compiler will reuse the stack frame !!
Of course, this all can be done, only if program sticks to the semantics of standard ( see pointer aliasing and it's effect on optimizations )
回答9:
Modern CPUs are very fast (obviously!). Almost every operation involved with calls and argument passing are full speed instructions (indirect calls might be slightly more expensive, mostly the first time through a loop).
Function call overhead is so small, only loops that call functions can make call overhead relevant.
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops. If a function has to do a bunch of (identical) work every time it is called, the compiler would be able to "hoist" it out of the loop and do it once if it was inlined. When not inlined, the code will probably just go ahead and repeat the work, you told it to!
Inlined functions seem impossibly faster not because of call and argument overhead, but because of common subexpressions that can be hoisted out of the function.
Example:
Foo::result_type MakeMeFaster()
{
Foo t = 0;
for (auto i = 0; i < 1000; ++i)
t += CheckOverhead(SomethingUnpredictible());
return t.result();
}
Foo CheckOverhead(int i)
{
auto n = CalculatePi_1000_digits();
return i * n;
}
An optimizer can see through this foolishness and do:
Foo::result_type MakeMeFaster()
{
Foo t;
auto _hidden_optimizer_tmp = CalculatePi_1000_digits();
for (auto i = 0; i < 1000; ++i)
t += SomethingUnpredictible() * _hidden_optimizer_tmp;
return t.result();
}
It seems like call overhead is impossibly reduced because it really has hoised a big chunk of the function out of the loop (the CalculatePi_1000_digits call). The compiler would need to be able to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do that.
回答10:
There are a few issues here.
If you have a smart enough compiler, it will do some automatic inlining for you even if you did not specify inline. On the other hand, there are many things that cannot be inlined.
If the function is virtual, then of course you are going to pay the price that it cannot be inlined because the target is determined at runtime. Conversely, in Java, you might be paying this price unless you indicate that the method is final.
Depending on how your code is organized in memory, you may be paying a cost in cache misses and even page misses as the code is located elsewhere. That can end up having a huge impact in some applications.
回答11:
There is not much overhead at all, especially with small (inline-able) functions or even classes.
The following example has three different tests that are each run many, many times and timed. The results are always equal to the order of a couple 1000ths of a unit of time.
#include <boost/timer/timer.hpp>
#include <iostream>
#include <cmath>
double sum;
double a = 42, b = 53;
//#define ITERATIONS 1000000 // 1 million - for testing
//#define ITERATIONS 10000000000 // 10 billion ~ 10s per run
//#define WORK_UNIT sum += a + b
/* output
8.609619s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.0%)
8.604478s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.1%)
8.610679s wall, 8.595655s user + 0.000000s system = 8.595655s CPU(99.8%)
9.5e+011 9.5e+011 9.5e+011
*/
#define ITERATIONS 100000000 // 100 million ~ 10s per run
#define WORK_UNIT sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
/* output
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
*/
// ------------------------------
double simple()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
WORK_UNIT;
}
return sum;
}
// ------------------------------
void call6()
{
WORK_UNIT;
}
void call5(){ call6(); }
void call4(){ call5(); }
void call3(){ call4(); }
void call2(){ call3(); }
void call1(){ call2(); }
double calls()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
call1();
}
return sum;
}
// ------------------------------
class Obj3{
public:
void runIt(){
WORK_UNIT;
}
};
class Obj2{
public:
Obj2(){it = new Obj3();}
~Obj2(){delete it;}
void runIt(){it->runIt();}
Obj3* it;
};
class Obj1{
public:
void runIt(){it.runIt();}
Obj2 it;
};
double objects()
{
sum = 0;
Obj1 obj;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
obj.runIt();
}
return sum;
}
// ------------------------------
int main(int argc, char** argv)
{
double ssum = 0;
double csum = 0;
double osum = 0;
ssum = simple();
csum = calls();
osum = objects();
std::cout << ssum << " " << csum << " " << osum << std::endl;
}
The output for running 10,000,000 iterations (of each type: simple, six function calls, three object calls) was with this semi-convoluted work payload:
sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
as follows:
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
Using a simple work payload of
sum += a + b
Gives the same results except a couple orders of magnitude faster for each case.
回答12:
Each new function requires a new local stack to be created. But the overhead of this would only be noticeable if you are calling a function on every iteration of a loop over a very large number of iterations.
回答13:
For most functions, their is no additional overhead for calling them in C++ vs C (unless you count that the "this" pointer as an unnecessary argument to every function.. You have to pass state to a function somehow tho)...
For virtual functions, their is an additional level of indirection (equivalent to a calling a function through a pointer in C)... But really, on today's hardware this is trivial.
回答14:
I don't have any numbers, either, but I'm glad you're asking. Too often I see people try to optimize their code starting with vague ideas of overhead, but not really knowing.
回答15:
Depending on how you structure your code, division into units such as modules and libraries it might matter in some cases profoundly.
- Using dynamic library function with external linkage will most of the time impose full stack frame processing.
That is why using qsort from stdc library is one order of magnitude (10 times) slower than using stl code when comparison operation is as simple as integer comparison. - Passing function pointers between modules will also be affected.
The same penalty will most likely affect usage of C++'s virtual functions as well as other functions, whose code is defined in separate modules.
Good news is that whole program optimization might resolve the issue for dependencies between static libraries and modules.
回答16:
As others have said, you really don't have to worry too much about overhead, unless you're going for ultimate performance or something akin. When you make a function the compiler has to write code to:
- Save function parameters to the stack
- Save the return address to the stack
- Jump to the starting address of the function
- Allocate space for the function's local variables (stack)
- Run the body of the function
- Save the return value (stack)
- Free space for the local variables aka garbage collection
- Jump back to the saved return address
- Free up save for the parameters etc...
However, you have to account for lowering the readability of your code, as well as how it will impact your testing strategies, maintenance plans, and overall size impact of your src file.
来源:https://stackoverflow.com/questions/144993/how-much-overhead-is-there-in-calling-a-function-in-c