Full Re-Write/Update for clarity (and your sanity, its abit too long) ... (Old Post)
For an assignment, I need to find the levels (L1,L2,...) and si
To answer your question of weird numbers above 1MB, it's pretty simple; cache eviction policies having to do with branch prediction, and the fact that the L3 cache is shared between the cores.
A core i3 has a very interesting cache structure. Actually any modern processor does. You should read about them on wikipedia; there are all sorts of ways for it to decide "well, I probably won't need this..." then it can say "I'll put it in the victim cache" or any number of things. L1-2-3 cache timings can be very complex based on a large number of factors and the individual design decisions made.
On top of that, all these decisions and more (see wikipedia articles on the subject) have to be synchronized between the two cores' caches. The methods to synchronize the shared L3 cache with separate L1 and L2 caches can be ugly, they can involve back-tracking and re-doing calculations or other methods. It's highly unlikely you'll ever have a completely free second core and nothing competing for L3 cache space and thus causing synchronization weirdness.
In general if you are working on data, say, convoluting a kernel, you want to make sure it fits within L1 cache and shape your algorithm around that. L3 cache isn't really meant for working on a data set the way you're doing it (though it is better than main memory!)
I swear if I was the one having to implement cache algorithms I'd go insane.
For the benchmarking with varying strides, you could try lat_mem_rd from lmbench package, it's open-source: http://www.bitmover.com/lmbench/lat_mem_rd.8.html
I had posted my port for Windows to http://habrahabr.ru/post/111876/ -- it's quite lengthy to be copypasted here. That's from two years ago, I didn't test it with modern CPUs.
After 10 minutes of searching the Intel instruction manual and another 10 minutes of coding I came up with this (for Intel based processors):
void i386_cpuid_caches () {
int i;
for (i = 0; i < 32; i++) {
// Variables to hold the contents of the 4 i386 legacy registers
uint32_t eax, ebx, ecx, edx;
eax = 4; // get cache info
ecx = i; // cache id
__asm__ (
"cpuid" // call i386 cpuid instruction
: "+a" (eax) // contains the cpuid command code, 4 for cache query
, "=b" (ebx)
, "+c" (ecx) // contains the cache id
, "=d" (edx)
); // generates output in 4 registers eax, ebx, ecx and edx
// See the page 3-191 of the manual.
int cache_type = eax & 0x1F;
if (cache_type == 0) // end of valid cache identifiers
break;
char * cache_type_string;
switch (cache_type) {
case 1: cache_type_string = "Data Cache"; break;
case 2: cache_type_string = "Instruction Cache"; break;
case 3: cache_type_string = "Unified Cache"; break;
default: cache_type_string = "Unknown Type Cache"; break;
}
int cache_level = (eax >>= 5) & 0x7;
int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
int cache_is_fully_associative = (eax >>= 1) & 0x1;
// See the page 3-192 of the manual.
// ebx contains 3 integers of 10, 10 and 12 bits respectively
unsigned int cache_sets = ecx + 1;
unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;
// Total cache size is the product
size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;
printf(
"Cache ID %d:\n"
"- Level: %d\n"
"- Type: %s\n"
"- Sets: %d\n"
"- System Coherency Line Size: %d bytes\n"
"- Physical Line partitions: %d\n"
"- Ways of associativity: %d\n"
"- Total Size: %zu bytes (%zu kb)\n"
"- Is fully associative: %s\n"
"- Is Self Initializing: %s\n"
"\n"
, i
, cache_level
, cache_type_string
, cache_sets
, cache_coherency_line_size
, cache_physical_line_partitions
, cache_ways_of_associativity
, cache_total_size, cache_total_size >> 10
, cache_is_fully_associative ? "true" : "false"
, cache_is_self_initializing ? "true" : "false"
);
}
}
Reference: Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A , page 3-190, CPUID—CPU Identification.
This is much more reliable then measuring cache latencies as it is pretty much impossible to turn off cache prefetching on a modern processor. If you require similar info for a different processor architecture you will have to consult the respective manual.
For windows, you can use the GetLogicalProcessorInformation method.
For linux, you may use sysconf()
. You can find valid arguments for sysconf
in /usr/include/unistd.h
or /usr/include/bits/confname.h
The time it takes to measure your time (that is, the time just to call the clock() function) is many many (many many many....) times greater than the time it takes to perform arr[(i*16)&lengthMod]++
. This extremely low signal-to-noise ratio (among other likely pitfalls) makes your plan unworkable. A large part of the problem is that you're trying to measure a single iteration of the loop; the sample code you linked is attempting to measure a full set of iterations (read the clock before starting the loop; read it again after emerging from the loop; do not use printf() inside the loop).
If your loop is large enough you might be able to overcome the signal-to-noise ratio problem.
As to "what element is being incremented"; arr
is an address of a 1MB buffer; arr[(i * 16) & lengthMod]++;
causes (i * 16) * lengthMod
to generate an offset from that address; that offset is the address of the int that gets incremented. You're performing a shift (i * 16 will turn into i << 4), a logical and, an addition, then either a read/add/write or a single increment, depending on your CPU).
Edit: As described, your code suffers from a poor SNR (signal to noise ratio) due to the relative speeds of memory access (cache or no cache) and calling functions just to measure the time. To get the timings you're currently getting, I assume you modified the code to look something like:
int main() {
int steps = 64 * 1024 * 1024;
int arr[1024 * 1024];
int lengthMod = (1024 * 1024) - 1;
int i;
double timeTaken;
clock_t start;
start = clock();
for (i = 0; i < steps; i++) {
arr[(i * 16) & lengthMod]++;
}
timeTaken = (double)(clock() - start)/CLOCKS_PER_SEC;
printf("Time for %d: %.12f \n", i, timeTaken);
}
This moves the measurement outside the loop so you're not measuring a single access (which would really be impossible) but rather you're measuring steps
accesses.
You're free to increase steps
as needed and this will have a direct impact on your timings. Since the times you're receiving are too close together, and in some cases even inverted (your time oscillates between sizes, which is not likely caused by cache), you might try changing the value of steps
to 256 * 1024 * 1024
or even larger.
NOTE: You can make steps
as large as you can fit into a signed int (which should be large enough), since the logical and ensures that you wrap around in your buffer.
I know this! (In reality it is very complicated because of pre-fetching)
for (times = 0; times < Max; time++) /* many times*/
for (i=0; i < ArraySize; i = i + Stride)
dummy = A[i]; /* touch an item in the array */
Changing stride allows you to test the properties of caches. By looking at a graph you will get your answers.
Look at slides 35-42 http://www.it.uu.se/edu/course/homepage/avdark/ht11/slides/11_Memory_and_optimization-1.pdf
Erik Hagersten is a really good teacher (and also really competent, was lead architect at sun at one point) so take a look at the rest of his slides for more great explanations!