The language I have done most optimisation in for numerical computation in C and C++ on Linux. I found that profiling, while useful, can skew your run time results so that cheap, frequently called operations (like c++ iterator increments). So take those with a grain of salt.
In terms of actual strategies that produced a good speed up are:
- Use numerical arrays rather than arrays of objects. For example, c++ has a "Complex" datatype. Operations on an array of these were a lot slower than a similar operation on two arrays of floats. This can be generalised to "use the machine types" for any performance critical code.
- Write the code to allow the compiler to be more effective in its optimisations. For example, if you have a fixed size array, use array indices so that auto vectorisation (a feature of Intel's compiler) can work.
- SIMD instructions can provide a good speedup if your problem fits into the kind of domain that they are designed for (multiplying/dividing floats or ints all at the same time). This is stuff like MMX, SSE, SSE2 etc.
- Using lookup tables with interpolation for expensive functions where exact values are not important. This is not always good, as looking up the data in memory might be expensive in its own right.
I hope that gives you some inspiration!