The Problem
We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limi
If you can afford it, try VTune. It provides MUCH more info than simple sampling (provided by gprof, as far as I know). You might give the Code Analyst a try. Latter is a decent, free software but it might not work correctly (or at all) with Intel CPUs.
Being equipped with such tool, it allows you to check various measure such as cache utilization (and basically memory layout), which - if used to its full extend - provides a huge boost to efficiency.
When you are sure that you algorithms and structures are optimal, then you should definitely use the multiple cores on i5 and i7. In other words, play around with different parallel programming algorithms/patterns and see if you can get a speed up.
When you have truly parallel data (array-like structures on which you perform similar/same operations) you should give OpenCL and SIMD instructions(easier to set up) a try.