I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I imagined looking at a certain hardware events going into and out of a routine in my library within the context of the bigger binary but it seems I would need a specific kernel module for PAPI to work for me (Linux 2.6.18 && Intel Xeon 5570)...there is Vtune which is specifically geared for intel processors but it seems like it's something which would profile the entire binary for performance and not specific code snippets (the 3-4 calls into my library).
Is there a way for me to use Vtune for my goal, or possibly something which can give me access to such counters without having to patch my kernel?
It's not possible to define an entry point in vtune to start recording.
However, what you can do, is to start the trace without recording and then when you expect to hit your library, you start the trace and let it record the calls. After the calls, you may stop it again and can now look up the library call using the top-bottom tab in vtune.
With it, you should be able to see all the information regarding the calls, and the time spent in each.
If you want to be sure that you only trace while the calls are active, you may start the application under gdb and insert breakpoints when accessing and leaving the functions you wish to examine and then start and stop the profiler appropriately.
Matias is right - you can start the profiling paused ("Start paused" in VTune-speak) and then in your program use __itt_pause / __itt_resume API from the VTune API to limit the data collection to the code region of interest.
You may also want to set the "Target duration type" to "Under one minute" in the project properties - this makes the sampling more fine grained (10 KHz instead of default 1 KHz frequency). Or manually adjust the Sample After value in the list of events to collect. The latter is often more useful when you want to profile something specific like mispredicted branches.
来源:https://stackoverflow.com/questions/11750297/is-it-possible-to-use-vtune-on-certain-code-snippets-in-a-binary-and-not-an-enti