问题
I am studying about cache effect using a simple micro-benchmark.
I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line.
In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that.
However, perf tool displays different result. It only occur 34,265 cache miss operations.
I am doubted about hardware prefetch, so turn off this function in BIOS. anyway, result is same.
I really don't know why perf tool's cache miss occur very small operations than "cachegrind".
Could someone give me a reasonable explanation?
1. Here is a simple micro-benchmark program.
#include <stdio.h>
#define N 10000000
double A[N];
int main(){
int i;
double temp=0.0;
for (i=0 ; i<N ; i++){
temp = A[i]*A[i];
}
return 0;
}
2. Following result is cachegrind's output:
==27612== Cachegrind, a cache and branch-prediction profiler
==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==27612== Command: ./test
==27612==
--27612-- warning: L3 cache found, using its data for the LL simulation.
==27612==
==27612== I refs: 110,102,998
==27612== I1 misses: 728
==27612== LLi misses: 720
==27612== I1 miss rate: 0.00%
==27612== LLi miss rate: 0.00%
==27612==
==27612== D refs: 70,038,455 (60,026,965 rd + 10,011,490 wr)
==27612== D1 misses: 1,251,802 ( 1,251,288 rd + 514 wr)
==27612== LLd misses: 1,251,624 ( 1,251,137 rd + 487 wr)
==27612== D1 miss rate: 1.7% ( 2.0% + 0.0% )
==27612== LLd miss rate: 1.7% ( 2.0% + 0.0% )
==27612==
==27612== LL refs: 1,252,530 ( 1,252,016 rd + 514 wr)
==27612== LL misses: 1,252,344 ( 1,251,857 rd + 487 wr)
==27612== LL miss rate: 0.6% ( 0.7% + 0.0% )
Generate a report File
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 4-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 8388608 B, 64 B, 16-way associative
Command: ./test
Data file: cache_block
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated: /home/jin/1_dev/99_test/OI/test.s
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
110,102,998 728 720 60,026,965 1,251,288 1,251,137 10,011,490 514 487 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
110,000,011 1 1 60,000,003 1,250,000 1,250,000 10,000,003 0 0 /home/jin/1_dev/99_test/OI/test.s:main
--------------------------------------------------------------------------------
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
-- line 2 ----------------------------------------
. . . . . . . . . .comm A,80000000,32
. . . . . . . . . .comm B,80000000,32
. . . . . . . . . .text
. . . . . . . . . .globl main
. . . . . . . . . .type main, @function
. . . . . . . . . main:
. . . . . . . . . .LFB0:
. . . . . . . . . .cfi_startproc
1 0 0 0 0 0 1 0 0 pushq %rbp
. . . . . . . . . .cfi_def_cfa_offset 16
. . . . . . . . . .cfi_offset 6, -16
1 0 0 0 0 0 0 0 0 movq %rsp, %rbp
. . . . . . . . . .cfi_def_cfa_register 6
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 1 1 0 0 0 1 0 0 movq %rax, -16(%rbp)
1 0 0 0 0 0 1 0 0 movl $0, -4(%rbp)
1 0 0 0 0 0 0 0 0 jmp .L2
. . . . . . . . . .L3:
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 1,250,000 1,250,000 0 0 0 movsd A(,%rax,8), %xmm1
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 0 0 0 0 0 movsd A(,%rax,8), %xmm0
10,000,000 0 0 0 0 0 0 0 0 mulsd %xmm1, %xmm0
10,000,000 0 0 0 0 0 10,000,000 0 0 movsd %xmm0, -16(%rbp)
10,000,000 0 0 10,000,000 0 0 0 0 0 addl $1, -4(%rbp)
. . . . . . . . . .L2:
10,000,001 0 0 10,000,001 0 0 0 0 0 cmpl $9999999, -4(%rbp)
10,000,001 0 0 0 0 0 0 0 0 jle .L3
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 0 0 1 0 0 0 0 0 popq %rbp
. . . . . . . . . .cfi_def_cfa 7, 8
1 0 0 1 0 0 0 0 0 ret
. . . . . . . . . .cfi_endproc
. . . . . . . . . .LFE0:
. . . . . . . . . .size main, .-main
. . . . . . . . . .ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
. . . . . . . . . .section .note.GNU-stack,"",@progbits
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
100 0 0 100 100 100 100 0 0 percentage of events annotated
3. Following result is perf's output:
$ sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test
Performance counter stats for './test' (10 runs):
113,898,951 instructions # 0.00 insns per cycle ( +- 12.73% ) [17.36%]
53,607 cache-references ( +- 12.92% ) [29.23%]
1,483 cache-misses # 2.767 % of all cache refs ( +- 26.66% ) [39.84%]
48,612,823 L1-dcache-loads ( +- 4.58% ) [50.45%]
34,256 L1-dcache-load-misses # 0.07% of all L1-dcache hits ( +- 18.94% ) [54.38%]
14,992,686 L1-dcache-stores ( +- 4.90% ) [52.58%]
1,980 L1-dcache-store-misses ( +- 6.36% ) [61.83%]
1,154 LLC-loads ( +- 61.14% ) [53.22%]
18 LLC-load-misses # 1.60% of all LL-cache hits ( +- 16.26% ) [10.87%]
0 LLC-prefetches [ 0.00%]
0.037949840 seconds time elapsed ( +- 3.57% )
More Experimental result(2014.05.13):
jin@desktop:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test
Performance counter stats for './test' (10 runs):
116,464,390 instructions # 0.00 insns per cycle ( +- 2.67% ) [67.43%]
5,994 r53024e <-- L1D hardware prefetch misses ( +- 21.74% ) [70.92%]
1,387,214 r53014e <-- L1D hardware prefetch requests ( +- 2.37% ) [75.61%]
61,667,802 L1-dcache-loads ( +- 1.27% ) [78.12%]
26,297 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 48.92% ) [43.24%]
0 r500f0a <-- LLC lines allocated [56.71%]
41,545 r500109 <-- Number of LLC read misses ( +- 6.16% ) [50.08%]
0.037080925 seconds time elapsed
In above result, the number of "L1D hardware prefetch request" seems like D1 miss(1,250,000) on cachegrind.
In my conclusion, if memory access the "stream pattern", then L1D prefetch function is enabled. and I can't check how many byte load from the memory due to LLC miss information.
Is my conclusion correct?
Editor's notes:
(1) According to the output of cachegrind
, the OP was most probably using gcc 4.6.3 with no optimizations.
(2) Some of the raw events used in perf stat
are only officially supported on Nehalem/Westmere, so I think that's the microarchitecture the OP is using.
(3) The bits set in most signifcant byte (i.e., third byte) in the raw event codes are ignored by perf
. (Although not all bits of the third byte are ignored.) So the events effectively are r024e, r014e, r0f0a, and r0109.
(4) The events r0f0a and r0109 are uncore events, but the OP has specified them as core events, which is wrong because perf
will measure them as core events.
回答1:
Bottom line: your assumption regarding prefetches is correct, but your workaround isn't.
First, as Carlo pointed out, this loop would usually get optimized out by any compiler. Since both perf and cachegrind show ~100M instructions do retire, I guess you didn't compile with optimizations, which means the behavior isn't very realistic - for example, your loop variable may be stored in memory instead of in a register, adding pointless memory accesses and skewing cache counters.
Now, the difference between your runs is that cachgrind is just a cache simulator, it doesn't simulate prefetches, so every first access to a line misses as expected. On the other hand, the real CPU does have HW prefetches as you can see, so the first time each line is brought from memory, it's done by a prefetch (thanks to the simple streaming pattern), and not by an actual demand load. This is why perf misses counting these accesses with the normal counters.
You can see that when enabling the prefetch counter, you see roughly the same N/8 prefetches (plus some additional ones from other types of accesses probably).
Disabling the prefetcher would seem the right thing, however most CPUs don't offer too much control over that. You didn't specify what processor type you're using, but if it was Intel for example, you can see here that only the L2 prefetches are controlled by the BIOS, while your output shows L1 prefetches - https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
Search the manuals for your CPU type to see which L1 prefetchers exist, and understand how to work around them. Usually a simple stride (larger than a single cache line) should suffice to trick them, but if that doesn't work, you'll need to change your access pattern to be more random. You can randomize some permutation of indices for that.
来源:https://stackoverflow.com/questions/23605341/i-dont-understand-cache-miss-count-between-cachegrind-vs-perf-tool