The Problem
We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limi
relatively new hardware (Intel i5 or i7)
Why not invest in a copy of the Intel compiler and high performance libraries? It can outperform GCC on optimizations by a significant margin, typically from 10% to 30% or even more, and even more so for heavy number-crunching programs. And Intel also provide a number of extensions and libraries for high-performance number-crunching (parallel) applications, if that's something you can afford to integrate into your code. It might payoff big if it ends up saving you months of running time.
We have already done our best optimizing the source to the limit of our programming skills
In my experience, the kind of micro- and nano- optimizations that you typically do with the help of a profiler tend to have a poor return on time-investments compared to macro-optimizations (streamlining the structure of the code) and, most importantly and often overlooked, memory access optimizations (e.g., locality of reference, in-order traversal, minimizing indirection, wielding out cache-misses, etc.). The latter usually involves designing the memory structures to better reflect the way the memory is used (traversed). Sometimes it can be as simple as switching a container type and getting a huge performance boost from that. Often, with profilers, you get lost in the details of the instruction-by-instruction optimizations, and memory layout issues don't show up and are usually missed when forgetting to look at the bigger picture. It's a much better way to invest your time, and the payoffs can be huge (e.g., many O(logN) algorithms end up performing almost as slow as O(N) just because of poor memory layouts (e.g., using a linked-list or linked-tree is a typical culprit of huge performance problems compared to a contiguous storage strategy)).