In Ulrich Drepper\'s paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between \"working set\" size an
With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c
you can see that the compiler is optimizing enough to remove the for loop (because sum
is not used).
I had to improve your source code; some #include
-s are missing, and i
is not declared in the second function, so your example don't even compile as it is.
Make sum
a global variable, or pass it somehow to the caller (perhaps with a global int globalsum;
and putting globalsum=sum;
after the loop).
And I am not sure you are right to clear the array with a memset
. I could imagine a clever-enough compiler understanding that you are summing all zeros.
At last your code has extremely regular behavior with good locality: once in a while, a cache miss happens, the entire cache line is loaded and data is good enough for many iterations. Some clever optimizations (e.g. -O3
or better) might generate the good prefetch
instructions. This is optimal for caches, because for a 32 words L1 cache line the cache miss happens every 32 loops so is well amortized.
Making a linked list of data will make cache behavior be worse. Conversely, in some real programs carefully adding a __builtin_prefetch at few well chosen places may improve performance by more than 10% (but adding too many of them will decrease performance).
In real life, the processor is spending the majority of the time to wait for some cache (and it is difficult to measure that; this waiting is CPU time, not idle time). Remember that during an L3 cache miss, the time needed to load data from your RAM module is the time needed to execute hundreds of machine instructions!