intel-pmu | 易学教程

Using PEBS and Linux Perf to Count the number of CPU cycles passed to execute X number of instructions

阅读更多关于 Using PEBS and Linux Perf to Count the number of CPU cycles passed to execute X number of instructions

问题 I want to do something like this: After 100 million instructions have passed, query the Linux perf HW CPU cycles and record it in a file. I want to use this code to characterize the performance of applications/benchmark programs during different phases of program execution. I have a hint that I need to setup Intel PEBS which overflows after 100 million instructions have passed and query the linux perf counters HW cpu cycles counter. Any pointer on where to start and if someone has already

How does perf use the offcore events?

阅读更多关于 How does perf use the offcore events?

问题 Some built-in perf events are mapped to offcore events. For example, LLC-loads and LLC-load-misses are mapped to OFFCORE_RESPONSE. events. This can be easily determined as discussed in here. However, these offcore events require writing certain values to certain MSR registers to actually specify a particular event. perf seems to be using an array called something like snb_hw_cache_extra_regs to specify what values to write to which MSR registers. I would like to know how this array is used.

What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth

阅读更多关于 What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth

How can I read performance counters from the kernel?

阅读更多关于 How can I read performance counters from the kernel?

问题 I have been using the Linux perf tool in the user space. I want to write code that reads performance counters for a thread every time it does a context switch. The steps required are: 1) Get a mechanism to read the performance counter registers. 2) Call step(1) from the scheduler after every context switch. I am stuck at step(1) as I could not figure out which functions to call for reading the performance registers and how to describe an event while doing it. I tried going through the docs

How to Configure and Sample Intel Performance Counters In-Process

阅读更多关于 How to Configure and Sample Intel Performance Counters In-Process

问题 In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system): results[] = ... for (iteration = 0; iteration < num_iterations; iteration++) { pctr_start = sample_pctr(); the_benchmark(); pctr_stop = sample_pctr(); results[iteration] = pctr_stop - pctr_start; } FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL , to read the number of core cycles independent of clock frequency changes (In

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

阅读更多关于 What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

问题 Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

阅读更多关于 What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of minor and major page faults is exactly 1 and 0, respectively, per page accessed. I've also measured all

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK value I should use to count the L1 cache hit events? Clarifications* 1) The final goal I want to achieve

Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

阅读更多关于 Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following: This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or from RS deallocation because of the RS array Write Port allocation scheme (each RS entry has two write ports instead of four. As a result, empty entries could not be used, although RS is not really full). This counts cycles that the pipeline backend blocked uop delivery from the front end. This basically says that there are two situations where the RS

Reliability of Xcode Instrument's disassembly time profiling

阅读更多关于 Reliability of Xcode Instrument's disassembly time profiling

问题 I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes me to believe these results are unreliable. Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results? Is there any reference expanding on this issue? 回答1: First of all,