I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of
Do you have reason to believe that the compiler is doing a poor job of uncovering ILP? If you work on the algorithm level normally the focus should be on data parallellism and higher-order optimizations. Optimizing for ILP would be the absolutely last step and is totally tied to how the compiler works. In general, if you can eliminate false data dependencies a decent compiler should do the rest for you.
Something like Acumems SlowSpotter may be a help (unless you really need to hand-optimize for ILP in which case I don't know of a good tool unless the compiler can spit out a good optimization report for you, IIRC the Cray and SGI MIPS compilers could produce reports like that.).