I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of
First, both the compiler and the CPU itself already aggressively reorder instructions to exploit ILP as well as possible. Most likely, they're doing a better job of it than you'd ever be able to.
However, there are a few areas where a human can aid the process.
The compiler is typically very conservative about reordering floating-point computations, because it might slightly change the result. So for example assuming this code:
float f, g, h, i;
float j = f + g + h + i;
you'll likely get zero ILP because the code you've written is evaluated as ((f + g) + h) + i
: the result of the first addition is used as an operand for the next, the result of which is used as an operand in the final addition. No two additions can execute in parallel.
If you instead write it as float j = (f + g) + (h + i)
, the CPU is able to execute f+g
and h+i
in parallel. They don't depend on each others.
In general, the thing preventing ILP is dependencies. Sometimes they're direct dependencies between arithmetic instructions as above, and sometimes they're store/load dependencies.
Loads and stores take a long time to execute compared to in-register operations, and operations that depend on these will have to wait until the load/store operation finished.
So storing data in temporaries which the compiler can cache in registers can sometimes be used to avoid memory accesses. Likewise, starting loads as soon as possible helps too, to avoid their latency from blocking the following operations.
The best technique is really to look closely at your code, and work out the dependency chains. Each sequence of operations where each one depends on the result of the previous is a chain of dependencies that can never be executed in parallel. Can this chain be broken up in some way? Perhaps by storing a value in a temporary, or perhaps by recomputing a value instead of waiting for the cached version to be loaded from memory. Perhaps just by placing a few parentheses as in the original floating-point example.
When there are no dependencies, the CPU will schedule operations to execute in parallel. So all you need to do to exploit ILP is to break up long dependency chains.
Of course, that's easier said than done... :)
But if you spend some time with a profiler, and study the assembly output from the compiler, you can sometimes get an impressive speedup from manually optimizing your code to better exploit ILP.