Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

问题

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:

IPS = sockets*(cores/sockets)*clock*(instructions/cycle)

At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.

But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?

Any piece of script or tool recommendation would be appreciated (in any language).

回答1:

perf stat ./my_program on Linux will use CPU performance counters to record how many instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions # 2.61 insn per cycle. This is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though.

But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.

For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?

How can I get real-time information at run-time

Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.

perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.

Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).

There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning.

And is it possible to find the type of instruction set (add, compare, in, jump, etc)?

Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all. But with Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events, so you can

ocperf.py stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program

It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)

Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)

For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.

On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.

Sorry I don't know what the equivalents are on AMD CPUs.

来源：https://stackoverflow.com/questions/50019857/benchmarking-how-to-count-number-of-instructions-sent-to-cpu-to-find-consumed

标签

performance

assembly

profiling

cpu

benchmarking