The Problem
We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limi
I would recommend taking a look at the type of operations that costitute the heavy lifting, and look for an optimized library. There are quite a lot of fast, assembly optimized, SIMD vectorized libraries out there for common problems (mostly math). Reinventing the wheel is often tempting, but it is usually not worth the effort if an existing soltuion can cover your needs.Since you have not stated what sort of simulation it is I can only provide some examples.
http://www.yeppp.info/
http://eigen.tuxfamily.org/index.php?title=Main_Page
https://github.com/xianyi/OpenBLAS