Perf tool stat output: multiplex and scaling of “cycles”

问题

I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.

The following is the output of perf tool:

 144094.487583      task-clock (msec)         #    1.017 CPUs utilized
  539912613776      instructions              #    1.09  insn per cycle           (83.42%)
  496622866196      cycles                    #    3.447 GHz                      (83.48%)
     340952514      cache-misses              #   10.354 % of all cache refs      (83.32%)
    3292972064      cache-references          #   22.854 M/sec                    (83.26%)
 144081.898558      cpu-clock (msec)          #    1.017 CPUs utilized
       4189372      page-faults               #    0.029 M/sec
             0      major-faults              #    0.000 K/sec
       4189372      minor-faults              #    0.029 M/sec
    8614431755      L1-dcache-load-misses     #    5.52% of all L1-dcache hits    (83.28%)
  156079653667      L1-dcache-loads           # 1083.223 M/sec                    (66.77%)

 141.622640316 seconds time elapsed

I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.

The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?

I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz.

回答1:

Peter Cordes' answer is on the right track.

PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.

I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.

There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:

$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...

   119.444.297.774      cycles:u                                                      (55,88%)
   130.133.371.858      instructions:u            #    1,09  insn per cycle                                              (67,81%)
        38.277.984      cache-misses:u            #    7,780 % of all cache refs      (72,92%)
       491.979.655      cache-references:u                                            (77,00%)
     3.892.617.942      L1-dcache-load-misses:u   #   15,57% of all L1-dcache hits    (82,19%)
    25.004.563.072      L1-dcache-loads:u                                             (43,85%)

Pinning instructions and cycles:

$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...
   120.683.697.083      cycles:Du                                                   
   132.185.743.504      instructions:Du           #    1,10  insn per cycle                                            
        27.917.126      cache-misses:u            #    4,874 % of all cache refs      (61,14%)
       572.718.930      cache-references:u                                            (71,05%)
     3.942.313.927      L1-dcache-load-misses:u   #   15,39% of all L1-dcache hits    (80,38%)
    25.613.635.647      L1-dcache-loads:u                                             (51,37%)

Which results in the same multiplexing as with omitting cycles and instructions does:

$ perf stat -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...

    35.333.318      cache-misses:u            #    7,212 % of all cache refs      (62,44%)
   489.922.212      cache-references:u                                            (73,87%)
 3.990.504.529      L1-dcache-load-misses:u   #   15,40% of all L1-dcache hits    (84,99%)
25.918.321.845      L1-dcache-loads:u

Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.

^{1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.}

回答2:

IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.

But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.

e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.

来源：https://stackoverflow.com/questions/48414787/perf-tool-stat-output-multiplex-and-scaling-of-cycles

标签

Linux

linux-kernel

intel

perf

intel-pmu