I am planning to measure PMU counters for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one
Summary of the Intel forum thread started by the OP:
The Linux perf
subsystem virtualizes the performance counters, but this means you have to read them with a system call, instead of rdpmc
, to get the full virtualized 64-bit value instead of whatever is currently in the architectural performance counter register.
If you want to use rdpmc
inside your own code so it can measure itself, pin each thread to a core because context switches don't save/restore PMCs. There's no easy way to avoid measuring everything that happens on the core, including interrupt handlers and other processes that get a timeslice. This can be a good thing, since you need to take the impact of kernel overhead into account.
More useful quotes from John D. McCalpin, PhD ("Dr. Bandwidth"):
For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html
You can use "pread()" on the /dev/cpu/*/msr device files to read the MSRs -- this may be a bit easier to read than IOCTL-based code. The codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent examples.
There have been a number of approaches to reserving and sharing performance counters, including both software-only and combined hardware+software approaches, but at this point there is not a "standard" approach. (It looks like Intel has a hardware-based approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what platforms support it.)
your questions
what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
You'll see random garbage, same if another process resets PMCs between timeslices of your process.