I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.