intel-pmu | 易学教程

Perf stat equivalent for Mac OS?

阅读更多关于 Perf stat equivalent for Mac OS?

问题 Is there a perf stat equivalent on Mac OS? I would like to do the same thing for a CLI command and googling is not yielding anything. 回答1: There was Instruments tool in Mac OS X to profile applications including with hardware PMU. Default is to do sampling profiler for CPU usage. Some docs: https://en.wikipedia.org/wiki/Instruments_(software) https://help.apple.com/instruments/mac/current/ It also has command line variant: https://help.apple.com/instruments/mac/current/#/devb14ffaa5 Open

Perf stat equivalent for Mac OS?

阅读更多关于 Perf stat equivalent for Mac OS?

PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

阅读更多关于 PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

问题 I'm working on a custom implementation on top of perf_event_open syscall. The implementation aims to support various of PERF_TYPE_HARDWARE , PERF_TYPE_SOFTWARE and PERF_TYPE_HW_CACHE events for specific threads on any core . In Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 3B I see the following for my testing CPU (Kaby Lake): To my understanding so far, one can monitor (theoretically) unlimited PERF_TYPE_SOFTWARE events concurrently but limited (without multiplexing) PERF

PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

阅读更多关于 PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

Perf tool stat output: multiplex and scaling of “cycles”

阅读更多关于 Perf tool stat output: multiplex and scaling of “cycles”

问题 I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output. The following is the output of perf tool: 144094.487583 task-clock (msec) # 1.017 CPUs utilized 539912613776 instructions # 1.09 insn per cycle (83.42%) 496622866196 cycles # 3.447 GHz (83.48%) 340952514 cache-misses # 10.354 % of all cache refs (83.32%) 3292972064 cache-references # 22.854 M/sec (83.26%) 144081.898558 cpu-clock (msec) # 1.017 CPUs utilized 4189372 page-faults # 0.029 M/sec 0 major

Why does Linux perf use event l1d.replacement for “L1 dcache misses” on x86?

阅读更多关于 Why does Linux perf use event l1d.replacement for “L1 dcache misses” on x86?

问题 On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace. Perhaps naively, I would have expected perf to use something like mem_load_retired.l1_miss , which supports PEBS and is defined as: Counts retired load instructions with at least one uop that missed in the L1 cache.

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

问题 I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK

Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

阅读更多关于 Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

问题 The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following: This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or from RS deallocation because of the RS array Write Port allocation scheme (each RS entry has two write ports instead of four. As a result, empty entries could not be used, although RS is not really full). This counts cycles that the pipeline backend

Why does the number of uops per iteration increase with the stride of streaming loads?

阅读更多关于 Why does the number of uops per iteration increase with the stride of streaming loads?

问题 Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages.

Can the Intel performance monitor counters be used to measure memory bandwidth?

阅读更多关于 Can the Intel performance monitor counters be used to measure memory bandwidth?

问题 Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level). 回答1: Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second. It gets a bit stickier to relate it accurately to