Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it woul
Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).
In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.
One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.
The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}
I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.
I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.