intel-pmu

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

时光总嘲笑我的痴心妄想 提交于 2019-12-01 07:39:38
Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program runs in user-mode, except in one case as I discuss below. The way the buffer is allocated does not seem

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

蓝咒 提交于 2019-12-01 04:19:52
问题 Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program

Can the LSD issue uOPs from the next iteration of the detected loop?

空扰寡人 提交于 2019-11-30 18:50:14
I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1 uOP (call it D) jmp .loop ;| 1 uOP (call it J) .end: mov eax, 60 xor edi, edi syscall Using perf we see that the loop runs at 1c/iter Performance counter stats for './main' (50 runs): 10,001,055 uops_executed_port_port_6 ( +- 0.00% ) 9,999,973 uops_executed_port_port_0 ( +- 0.00% ) 10,015,414 cycles:u ( +- 0.02% ) 23 resource_stalls_rs ( +- 64.05% ) My interpretations of these

Why does the number of uops per iteration increase with the stride of streaming loads?

点点圈 提交于 2019-11-30 13:42:12
Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages. So we can easily experiment with very large buffers. The loop consists of 4 instructions. Each

Can the LSD issue uOPs from the next iteration of the detected loop?

允我心安 提交于 2019-11-30 03:23:16
问题 I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1 uOP (call it D) jmp .loop ;| 1 uOP (call it J) .end: mov eax, 60 xor edi, edi syscall Using perf we see that the loop runs at 1c/iter Performance counter stats for './main' (50 runs): 10,001,055 uops_executed_port_port_6 ( +- 0.00% ) 9,999,973 uops_executed_port_port_0 ( +- 0.00% )

Hardware cache events and perf

♀尐吖头ヾ 提交于 2019-11-26 23:24:28
问题 When I run perf list I see a bunch of Hardware Cache Events , as follows: $ perf list | grep 'cache event' L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache