I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of
The problem is that deciding whether an instruction will be executed in parallel is quite difficult considering how many different processor types there are. A good understanding of the CPU architecture you are targeting will give you a good starting point for doing this sort of work. No software will beat a human mind with the right knowledge.
In general though so much work is done by the compiler and things like Out-of-order execution engines that this tries to get abstracted as much away from you as possible. You will find even by understanding this fully its unlikely you'll get more than a few percent speed improvement.
If you want to see serious speed improvements you are far better off re-writing the algorithm to take advantage of multiple processors and available SIMD operations. You can see serious speed improvements using SIMD alone and this is especially so for a lot of "multimedia algorithms" that can process multiple elements of the data simultaneously.