intel-pmu

Using PEBS and Linux Perf to Count the number of CPU cycles passed to execute X number of instructions

送分小仙女□ 提交于 2019-12-11 04:27:00
问题 I want to do something like this: After 100 million instructions have passed, query the Linux perf HW CPU cycles and record it in a file. I want to use this code to characterize the performance of applications/benchmark programs during different phases of program execution. I have a hint that I need to setup Intel PEBS which overflows after 100 million instructions have passed and query the linux perf counters HW cpu cycles counter. Any pointer on where to start and if someone has already

How does perf use the offcore events?

空扰寡人 提交于 2019-12-11 04:11:30
问题 Some built-in perf events are mapped to offcore events. For example, LLC-loads and LLC-load-misses are mapped to OFFCORE_RESPONSE. events. This can be easily determined as discussed in here. However, these offcore events require writing certain values to certain MSR registers to actually specify a particular event. perf seems to be using an array called something like snb_hw_cache_extra_regs to specify what values to write to which MSR registers. I would like to know how this array is used.

What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth

早过忘川 提交于 2019-12-11 02:56:42
问题 I am trying to measure the PCIe bandwidth of NIC devices using Intel® Performance Counter Monitor (PCM) tools. But, I am not able to understand the output of it. To measure the PCIe bandwidth, I executed the binary pcm-iio. This binary helps to measure the monitor PCIe bandwidth per PCIe device. After executing the binary I got the following output. |IIO Stack 2 - PCIe1 |IB write|IB read|OB read|OB write|TLB Miss|VT-d L3 Miss|VT-d CTXT Miss|VT-d Lookup| |_____________________________|________

How can I read performance counters from the kernel?

♀尐吖头ヾ 提交于 2019-12-11 02:39:32
问题 I have been using the Linux perf tool in the user space. I want to write code that reads performance counters for a thread every time it does a context switch. The steps required are: 1) Get a mechanism to read the performance counter registers. 2) Call step(1) from the scheduler after every context switch. I am stuck at step(1) as I could not figure out which functions to call for reading the performance registers and how to describe an event while doing it. I tried going through the docs

How to Configure and Sample Intel Performance Counters In-Process

蓝咒 提交于 2019-12-11 00:53:16
问题 In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system): results[] = ... for (iteration = 0; iteration < num_iterations; iteration++) { pctr_start = sample_pctr(); the_benchmark(); pctr_stop = sample_pctr(); results[iteration] = pctr_stop - pctr_start; } FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL , to read the number of core cycles independent of clock frequency changes (In

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

喜欢而已 提交于 2019-12-08 00:25:42
问题 Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

徘徊边缘 提交于 2019-12-06 07:12:17
Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of minor and major page faults is exactly 1 and 0, respectively, per page accessed. I've also measured all

Intel PMU event for L1 cache hit event

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 04:34:05
I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK value I should use to count the L1 cache hit events? Clarifications* 1) The final goal I want to achieve

Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

邮差的信 提交于 2019-12-05 14:29:57
The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following: This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or from RS deallocation because of the RS array Write Port allocation scheme (each RS entry has two write ports instead of four. As a result, empty entries could not be used, although RS is not really full). This counts cycles that the pipeline backend blocked uop delivery from the front end. This basically says that there are two situations where the RS

Reliability of Xcode Instrument's disassembly time profiling

走远了吗. 提交于 2019-12-01 16:07:51
问题 I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes me to believe these results are unreliable. Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results? Is there any reference expanding on this issue? 回答1: First of all,