Have you used any profiling tool like Intel Vtune analyzer?
What are your recommendations for a C++ multi threaded application on Linux and windows? I am primarily
The Rational PurifyPlus suite includes both a well-proven leak detector and pretty good profiler. I'm not sure if it does go down to the level of cache misses, though - you might need VTune for that.
PurifyPlus is available both on various Unices and Windows so it should cover your requirements, but unfortunately in contrast to Valgrind, it isn't free.