问题
I'm trying to come up with an example program which would have a high cache-miss rate. I thought I could try accessing a matrix column by column like so:
#include <stdlib.h>
int main(void)
{
int i, j, k;
int w = 1000;
int h = 1000;
int **block = malloc(w * sizeof(int*));
for (i = 0; i < w; i++) {
block[i] = malloc(h * sizeof(int));
}
for (k = 0; k < 10; k++) {
for (i = 0; i < w; i++) {
for (j = 0; j < h; j++) {
block[j][i] = 0;
}
}
}
return 0;
}
when I compile this with -O0
flag and run using perf stat -r 5 -B -e cache-references,cache-misses ./a.out
it gives me:
Performance counter stats for './a.out' (5 runs):
715,463 cache-references ( +- 0.42% )
527,634 cache-misses # 73.747 % of all cache refs ( +- 2.53% )
0.112001160 seconds time elapsed ( +- 1.58% )
which is good enough for my purposes. However if I go ahead and change the matrix size to 2000x2000
it gives:
Performance counter stats for './a.out' (5 runs):
6,364,995 cache-references ( +- 2.32% )
2,534,989 cache-misses # 39.827 % of all cache refs ( +- 0.02% )
0.461104903 seconds time elapsed ( +- 0.92% )
and if I increase it even further to 3000x3000
I get:
Performance counter stats for './a.out' (5 runs):
59,204,028 cache-references ( +- 1.36% )
5,662,629 cache-misses # 9.565 % of all cache refs ( +- 0.11% )
1.116573625 seconds time elapsed ( +- 0.32% )
which is strange because I would expect to get more cache miss rate as the size increases. I need something that will be as platform independent as possible. computer architecture class was long ago so any insight would be welcomed..
Notes
I said I need something relatively platform independent but still these are my specs:
- Intel® Core™ i5-2467M
- 4 GiB RAM
- 64 bit ubuntu 12.04
回答1:
Beware of automatic prefetch in modern CPUs - it can often detect strided accesses. Perhaps try a random access pattern, e.g.:
int main(void)
{
int i;
int n = 1000 * 1000;
int *block = malloc(n * sizeof(int));
for (i = 0; i < n / 10; i++) {
int ri = rand() % n;
block[ri] = 0;
}
return 0;
}
回答2:
I'm not completely certain you can compare these programs or really guarantee anything, because it depends on how the OS is allocating individual chunks of memory.
You should at least allocate ALL memory as a single block, then index into that block to get all the arrays (int*
and int
). That way you have a consistent starting point. You may want to pass the array size as an argument instead of recompiling each time.
You can also tweak it so that you allocate WAY more memory than you need and put each row (or column, the way you have written it), to guarantee that only one row (column) of the matrix will be loaded in cache at any one time. ie find out the size of your cache, and space each chunk at least that many bytes apart.
Note that you should really free
your memory before exiting.
As already pointed out by others, randomizing your access pattern is a good idea.
来源:https://stackoverflow.com/questions/14635562/how-to-come-up-with-a-high-cache-miss-rate-example