I have a multithreaded C++ program that I would like to profile at the function level to detect inefficiencies such as lock contention. I know Intel Vtune does this, and all