I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of
First, both the compiler and the CPU itself already aggressively reorder instructions to exploit ILP as well as possible. Most likely, they're doing a better job of it than you'd ever be able to.
However, there are a few areas where a human can aid the process.
The compiler is typically very conservative about reordering floating-point computations, because it might slightly change the result. So for example assuming this code:
float f, g, h, i;
float j = f + g + h + i;
you'll likely get zero ILP because the code you've written is evaluated as ((f + g) + h) + i
: the result of the first addition is used as an operand for the next, the result of which is used as an operand in the final addition. No two additions can execute in parallel.
If you instead write it as float j = (f + g) + (h + i)
, the CPU is able to execute f+g
and h+i
in parallel. They don't depend on each others.
In general, the thing preventing ILP is dependencies. Sometimes they're direct dependencies between arithmetic instructions as above, and sometimes they're store/load dependencies.
Loads and stores take a long time to execute compared to in-register operations, and operations that depend on these will have to wait until the load/store operation finished.
So storing data in temporaries which the compiler can cache in registers can sometimes be used to avoid memory accesses. Likewise, starting loads as soon as possible helps too, to avoid their latency from blocking the following operations.
The best technique is really to look closely at your code, and work out the dependency chains. Each sequence of operations where each one depends on the result of the previous is a chain of dependencies that can never be executed in parallel. Can this chain be broken up in some way? Perhaps by storing a value in a temporary, or perhaps by recomputing a value instead of waiting for the cached version to be loaded from memory. Perhaps just by placing a few parentheses as in the original floating-point example.
When there are no dependencies, the CPU will schedule operations to execute in parallel. So all you need to do to exploit ILP is to break up long dependency chains.
Of course, that's easier said than done... :)
But if you spend some time with a profiler, and study the assembly output from the compiler, you can sometimes get an impressive speedup from manually optimizing your code to better exploit ILP.
Do you have reason to believe that the compiler is doing a poor job of uncovering ILP? If you work on the algorithm level normally the focus should be on data parallellism and higher-order optimizations. Optimizing for ILP would be the absolutely last step and is totally tied to how the compiler works. In general, if you can eliminate false data dependencies a decent compiler should do the rest for you.
Something like Acumems SlowSpotter may be a help (unless you really need to hand-optimize for ILP in which case I don't know of a good tool unless the compiler can spit out a good optimization report for you, IIRC the Cray and SGI MIPS compilers could produce reports like that.).
If I read you correctly, you are not interested in SIMD or threads, just getting optimal ordering of normal CPU instructions.
The first thing to check is if your compiler is targeting the correct CPU subtype. The compiler will usually reorder instructions to reduce dependencies from one instruction to another, but it is vital for the compiler to know specifically which version of the CPU you're targeting. (specifically older GCC sometimes fails to detect recent CPUs and then optimizes for i386).
Second thing you can do is checking your compiler inlining decisions (by looking at the assembler). Inlining small functions in algorithms can increase the code size but will improve the amount of opportunity for the compiler to optimize, as multiple calculations can be done in paralell. I often resort to forced inlining.
Lastly, for intel cpu's, Intel's own C++ compiler claims to be best at this. They also have the vTune profiler that specifically can report efficient use of the ALU's in your program's hotspots.
The previous answers are good. In addition, there is much to learn at Intel's site, and if you have a budget then Intel's tools would be worth looking at.
Intel's articles on Optimization
The problem is that deciding whether an instruction will be executed in parallel is quite difficult considering how many different processor types there are. A good understanding of the CPU architecture you are targeting will give you a good starting point for doing this sort of work. No software will beat a human mind with the right knowledge.
In general though so much work is done by the compiler and things like Out-of-order execution engines that this tries to get abstracted as much away from you as possible. You will find even by understanding this fully its unlikely you'll get more than a few percent speed improvement.
If you want to see serious speed improvements you are far better off re-writing the algorithm to take advantage of multiple processors and available SIMD operations. You can see serious speed improvements using SIMD alone and this is especially so for a lot of "multimedia algorithms" that can process multiple elements of the data simultaneously.