So I am trying to measure the latencies of L1, L2, L3 cache using C. I know the size of them and I feel I understand conceptually how to do it but I am running into problems
Ok, several issues with your code:
As you mentioned, your measurement are taking a long time. In fact, they're very likely to take way longer than the single access itself, so they're not measuring anything useful. To mitigate that, access multiple elements, and amortize (divide the overall time by the number of accesses. Note that to measure latency, you want these accesses to be serialized, otherwise they can be performed in parallel and you'll only measure the throughput of unrelated accesses. To achieve that you could just add a false dependency between accesses.
For e.g., initialize the array to zeros, and do:
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
for (int i = 0; i < NUM_ACCESSES; ++i) {
int tmp = arrayAccess[index]; //Access Value from Main Memory
index = (index + i + tmp) & 1023;
}
clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
.. and of course remember to divide the time by NUM_ACCESSES
.
Now, i've made the index intentionally complicated so that you avoid a fixed stride which might trigger a prefetcher (a bit of an overkill, you're not likely to notice an impact, but for the sake of demonstration...). You could probably settle for a simple index += 32
, which would give you strides of 128k (two cache lines), and avoid the "benefit" of most simple adjacent line/ simple stream prefetchers. I've also replaced the % 1000
with & 1023
since &
is faster, but it needs to be power of 2 to work the same way - so just increase ACCESS_SIZE
to 1024 and it should work.
Invalidating the L1 by loading something else is good, but the sizes look funny. You didn't specify your system but 256000
seems pretty big for an L1. An L2 is usually 256k on many common modern x86 CPUs for e.g. Also note that 256k is not 256000
, but rather 256*1024=262144
. Same goes for the second size: 1M is not 1024000
, it's 1024*1024=1048576
. Assuming that's indeed your L2 size (more likely an L3, but probably too small for that).
Your invalidating arrays are of type int
, so each element is longer than a single byte (most likely 4 byte, depending on system). You're actually invalidating L1_CACHE_SIZE*sizeof(int)
worth of bytes (and the same goes for the L2 invalidation loop)
memset
receives the size in bytes, your sizes are divided by sizeof(int)
Your invalidation reads are never used, and may be optimized out. Try to accumulate the reads in some value and print it in the end, to avoid this possibility.
The memset at the beginning is accessing the data as well, therefor your first loop is accessing data from the L3 (since the other 2 memsets were still effective in evicting it from L1+L2, although only partially due to the size error.
The strides may be too small so you get two access to the same cacheline (L1 hit). Make sure they're spread enough by adding 32 elements (x4 bytes) - that's 2 cacheline, so you also won't get any adjacent cacheline prefetch benefits.
Since NUM_ACCESSES is larger than ACCESS_SIZE, you're essentially repeating the same elements and would probably get L1 hits for them (so the avg time shifts in favor of L1 access latency). Instead try using the L1 size so you access the entire L1 (except for the skips) exactly once. For e.g. like this -
index = 0;
while (index < L1_CACHE_SIZE) {
int tmp = arrayAccess[index]; //Access Value from L2
index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides
count++; //divide overall time by this
}
don't forget to increase arrayAccess
to L1 size.
Now, with the changes above (more or less), I get something like this:
L1 Cache Access 7.812500
L2 Cache Acces 15.625000
L3 Cache Access 23.437500
Which still seems a bit long, but possibly because it includes an additional dependency on arithmetic operations
I would rather try to use the hardware clock as a measure. The rdtsc
instruction will tell you the current cycle count since the CPU was powered up. Also it is better to use asm
to make sure always the same instructions are used in both measured and dry runs. Using that and some clever statistics I have made this a long time ago:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
int i386_cpuid_caches (size_t * data_caches) {
int i;
int num_data_caches = 0;
for (i = 0; i < 32; i++) {
// Variables to hold the contents of the 4 i386 legacy registers
uint32_t eax, ebx, ecx, edx;
eax = 4; // get cache info
ecx = i; // cache id
asm (
"cpuid" // call i386 cpuid instruction
: "+a" (eax) // contains the cpuid command code, 4 for cache query
, "=b" (ebx)
, "+c" (ecx) // contains the cache id
, "=d" (edx)
); // generates output in 4 registers eax, ebx, ecx and edx
// taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149
int cache_type = eax & 0x1F;
if (cache_type == 0) // end of valid cache identifiers
break;
char * cache_type_string;
switch (cache_type) {
case 1: cache_type_string = "Data Cache"; break;
case 2: cache_type_string = "Instruction Cache"; break;
case 3: cache_type_string = "Unified Cache"; break;
default: cache_type_string = "Unknown Type Cache"; break;
}
int cache_level = (eax >>= 5) & 0x7;
int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
int cache_is_fully_associative = (eax >>= 1) & 0x1;
// taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A
// ebx contains 3 integers of 10, 10 and 12 bits respectively
unsigned int cache_sets = ecx + 1;
unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;
// Total cache size is the product
size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;
if (cache_type == 1 || cache_type == 3) {
data_caches[num_data_caches++] = cache_total_size;
}
printf(
"Cache ID %d:\n"
"- Level: %d\n"
"- Type: %s\n"
"- Sets: %d\n"
"- System Coherency Line Size: %d bytes\n"
"- Physical Line partitions: %d\n"
"- Ways of associativity: %d\n"
"- Total Size: %zu bytes (%zu kb)\n"
"- Is fully associative: %s\n"
"- Is Self Initializing: %s\n"
"\n"
, i
, cache_level
, cache_type_string
, cache_sets
, cache_coherency_line_size
, cache_physical_line_partitions
, cache_ways_of_associativity
, cache_total_size, cache_total_size >> 10
, cache_is_fully_associative ? "true" : "false"
, cache_is_self_initializing ? "true" : "false"
);
}
return num_data_caches;
}
int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) {
int fd = open("/dev/urandom", O_RDONLY);
if (fd < 0) {
perror("open");
abort();
}
char * random_data = mmap(
NULL
, lower_cache_size
, PROT_READ | PROT_WRITE
, MAP_PRIVATE | MAP_ANON // | MAP_POPULATE
, -1
, 0
); // get some random data
if (random_data == MAP_FAILED) {
perror("mmap");
abort();
}
size_t i;
for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) {
random_data[i] = 1;
}
int64_t random_offset = 0;
while (attempts--) {
// use processor clock timer for exact measurement
random_offset += rand();
random_offset %= lower_cache_size;
int32_t cycles_used, edx, temp1, temp2;
asm (
"mfence\n\t" // memory fence
"rdtsc\n\t" // get cpu cycle count
"mov %%edx, %2\n\t"
"mov %%eax, %3\n\t"
"mfence\n\t" // memory fence
"mov %4, %%al\n\t" // load data
"mfence\n\t"
"rdtsc\n\t"
"sub %2, %%edx\n\t" // substract cycle count
"sbb %3, %%eax" // substract cycle count
: "=a" (cycles_used)
, "=d" (edx)
, "=r" (temp1)
, "=r" (temp2)
: "m" (random_data[random_offset])
);
// printf("%d\n", cycles_used);
if (cycles_used < max_latency)
latencies[cycles_used]++;
else
latencies[max_latency - 1]++;
}
munmap(random_data, lower_cache_size);
return 0;
}
int main() {
size_t cache_sizes[32];
int num_data_caches = i386_cpuid_caches(cache_sizes);
int latencies[0x400];
memset(latencies, 0, sizeof(latencies));
int empty_cycles = 0;
int i;
int attempts = 1000000;
for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles
int32_t cycles_used, edx, temp1, temp2;
asm (
"mfence\n\t" // memory fence
"rdtsc\n\t" // get cpu cycle count
"mov %%edx, %2\n\t"
"mov %%eax, %3\n\t"
"mfence\n\t" // memory fence
"mfence\n\t"
"rdtsc\n\t"
"sub %2, %%edx\n\t" // substract cycle count
"sbb %3, %%eax" // substract cycle count
: "=a" (cycles_used)
, "=d" (edx)
, "=r" (temp1)
, "=r" (temp2)
:
);
if (cycles_used < sizeof(latencies) / sizeof(*latencies))
latencies[cycles_used]++;
else
latencies[sizeof(latencies) / sizeof(*latencies) - 1]++;
}
{
int j;
size_t sum = 0;
for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
sum += latencies[j];
}
size_t sum2 = 0;
for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
sum2 += latencies[j];
if (sum2 >= sum * .75) {
empty_cycles = j;
fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles);
break;
}
}
}
for (i = 0; i < num_data_caches; i++) {
test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));
int j;
size_t sum = 0;
for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
sum += latencies[j];
}
size_t sum2 = 0;
for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
sum2 += latencies[j];
if (sum2 >= sum * .75) {
fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles);
break;
}
}
}
return 0;
}
Output on my Core2Duo:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Total Size: 32768 bytes (32 kb)
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Total Size: 32768 bytes (32 kb)
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Total Size: 262144 bytes (256 kb)
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Total Size: 3145728 bytes (3072 kb)
Empty counting takes 90 cycles
Cache ID 0 has latency 6 cycles
Cache ID 2 has latency 21 cycles
Cache ID 3 has latency 168 cycles
Widely used classic test for cache latency is iterating over the linked list. It works on modern superscalar/superpipelined CPU and even on Out-of-order cores like ARM Cortex-A9+ and Intel Core 2/ix. This method is used by open-source lmbench - in the test lat_mem_rd
(man page) and in CPU-Z latency measurement tool: http://cpuid.com/medias/files/softwares/misc/latency.zip (native Windows binary)
There are sources of lat_mem_rd test from lmbench: https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c
And the main test is
#define ONE p = (char **)*p;
#define FIVE ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY
void
benchmark_loads(iter_t iterations, void *cookie)
{
struct mem_state* state = (struct mem_state*)cookie;
register char **p = (char**)state->p[0];
register size_t i;
register size_t count = state->len / (state->line * 100) + 1;
while (iterations-- > 0) {
for (i = 0; i < count; ++i) {
HUNDRED;
}
}
use_pointer((void *)p);
state->p[0] = (char*)p;
}
So, after deciphering the macro we do a lot of linear operations like:
p = (char**) *p; // (in intel syntax) == mov eax, [eax]
p = (char**) *p;
p = (char**) *p;
.... // 100 times total
p = (char**) *p;
over the memory, filled with pointers, every pointing stride
elements forward.
As says the man page http://www.bitmover.com/lmbench/lat_mem_rd.8.html
The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by
p = (char **)*p;
in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads. The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.
More detailed description with examples on POWERs is available from IBM's wiki: Untangling memory access measurements - lat_mem_rd - by Jenifer Hopper 2013
The lat_mem_rd test (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) takes two arguments, an array size in MB and a stride size. The benchmark uses two loops to traverse through the array, using the stride as the increment by creating a ring of pointers that point forward one stride. The test measures memory read latency in nanoseconds for the range of memory sizes. The output consists of two columns: the first is the array size in MB (the floating point value) and the second is the load latency over all the points of the array. When the results are graphed, you can clearly see the relative latencies of the entire memory hierarchy, including the faster latency of each cache level, and the main memory latency.
PS: There is paper from Intel (thanks to Eldar Abusalimov) with examples of running lat_mem_rd: ftp://download.intel.com/design/intarch/PAPERS/321074.pdf - sorry right url is http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-cache-latency-bandwidth-paper.pdf "Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel Architecture" by Joshua Ruggiero from December 2008:
Well for those interested, I couldn't get my first code set to work so I tried a couple alternative approaches that produced decent results.
The first used linked lists with nodes allocated stride bytes apart in a contiguous memory space. The dereferencing of the nodes mitigates the effectiveness of the pre-fetcher and in the case that multiple cache lines are pulled in I made the strides significantly large to avoid cache hits. As the size of the list allocated increases, it jumps to the cache or memory structure that will hold it showing clear divisions in latency.
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
//MACROS
#define ONE iterate = (char**) *iterate;
#define FIVE ONE ONE ONE
#define TWOFIVE FIVE FIVE FIVE FIVE FIVE
#define HUNDO TWOFIVE TWOFIVE TWOFIVE TWOFIVE
//prototype
void allocateRandomArray(long double);
void accessArray(char *, long double, char**);
int main(){
//call the function for allocating arrays of increasing size in MB
allocateRandomArray(.00049);
allocateRandomArray(.00098);
allocateRandomArray(.00195);
allocateRandomArray(.00293);
allocateRandomArray(.00391);
allocateRandomArray(.00586);
allocateRandomArray(.00781);
allocateRandomArray(.01172);
allocateRandomArray(.01562);
allocateRandomArray(.02344);
allocateRandomArray(.03125);
allocateRandomArray(.04688);
allocateRandomArray(.0625);
allocateRandomArray(.09375);
allocateRandomArray(.125);
allocateRandomArray(.1875);
allocateRandomArray(.25);
allocateRandomArray(.375);
allocateRandomArray(.5);
allocateRandomArray(.75);
allocateRandomArray(1);
allocateRandomArray(1.5);
allocateRandomArray(2);
allocateRandomArray(3);
allocateRandomArray(4);
allocateRandomArray(6);
allocateRandomArray(8);
allocateRandomArray(12);
allocateRandomArray(16);
allocateRandomArray(24);
allocateRandomArray(32);
allocateRandomArray(48);
allocateRandomArray(64);
allocateRandomArray(96);
allocateRandomArray(128);
allocateRandomArray(192);
}
void allocateRandomArray(long double size){
int accessSize=(1024*1024*size); //array size in bytes
char * randomArray = malloc(accessSize*sizeof(char)); //allocate array of size allocate size
int counter;
int strideSize=4096; //step size
char ** head = (char **) randomArray; //start of linked list in contiguous memory
char ** iterate = head; //iterator for linked list
for(counter=0; counter < accessSize; counter+=strideSize){
(*iterate) = &randomArray[counter+strideSize]; //iterate through linked list, having each one point stride bytes forward
iterate+=(strideSize/sizeof(iterate)); //increment iterator stride bytes forward
}
*iterate = (char *) head; //set tailf to point to head
accessArray(randomArray, size, head);
free(randomArray);
}
void accessArray(char *cacheArray, long double size, char** head){
const long double NUM_ACCESSES = 1000000000/100; //number of accesses to linked list
const int SECONDS_PER_NS = 1000000000; //const for timer
FILE *fp = fopen("accessData.txt", "a"); //open file for writing data
int newIndex=0;
int counter=0;
int read=0;
struct timespec startAccess, endAccess; //struct for timer
long double accessTime = 0;
char ** iterate = head; //create iterator
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
for(counter=0; counter < NUM_ACCESSES; counter++){
HUNDO //macro subsitute 100 accesses to mitigate loop overhead
}
clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
//calculate the time elapsed in ns per access
accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (100*NUM_ACCESSES);
fprintf(fp, "%Lf\t%Lf\n", accessTime, size); //print results to file
fclose(fp); //close file
}
This produced the most consistent results, and using a variety of array sizes and plotting the respective latencies gave a very clear distinction of the different cache sizes present.
The next method like the previous allocated increasing size arrays. But instead of using a linked list for memory access, I fill each index with its respective number and randomly shuffled the array. I then used these indexes to hop around randomly within the array for accesses, mitigating the effects of the pre-fetcher. However, it had an occasional strong deviation in access time when multiple adjacent cache lines are pulled in and happen to be hit.
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
//prototype
void allocateRandomArray(long double);
void accessArray(int *, long int);
int main(){
srand(time(NULL)); // Seed random function
int i=0;
for(i=2; i < 32; i++){
allocateRandomArray(pow(2, i)); //call latency function on arrays of increasing size
}
}
void allocateRandomArray(long double size){
int accessSize = (size) / sizeof(int);
int * randomArray = malloc(accessSize*sizeof(int));
int counter;
for(counter=0; counter < accessSize; counter ++){
randomArray[counter] = counter;
}
for(counter=0; counter < accessSize; counter ++){
int i,j;
int swap;
i = rand() % accessSize;
j = rand() % accessSize;
swap = randomArray[i];
randomArray[i] = randomArray[j];
randomArray[j] = swap;
}
accessArray(randomArray, accessSize);
free(randomArray);
}
void accessArray(int *cacheArray, long int size){
const long double NUM_ACCESSES = 1000000000;
const int SECONDS_PER_NS = 1000000000;
int newIndex=0;
int counter=0;
int read=0;
struct timespec startAccess, endAccess;
long double accessTime = 0;
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
for(counter = 0; counter < NUM_ACCESSES; counter++){
newIndex=cacheArray[newIndex];
}
clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
//calculate the time elapsed in ns per access
accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (NUM_ACCESSES);
printf("Access time: %Lf for size %ld\n", accessTime, size);
}
Averaged across many trials, this method produced relatively accurate results as well. The first choice is definitely the better of the two but this is an alternate approach that works fine as well.
Not really an answer but read anyway some thing has already been mentioned in other answers and comments here
well just the other day I answer this question:
it is about measurement of L1/L2/.../L?/MEMORY
transfer rates take a look at it for better start point of your problem
[Notes]
I strongly recommend to use RDTSC instruction for time measurement
especially for L1 as anything else is too slow. Do not forget to set process affinity to single CPU because all cores have their own counter and their count differs a lot even on the same input Clock !!!
Adjust the CPU clock to Maximum for variable clock computers and do not forget to account for RDTSC
overflow if you use just 32bit part (modern CPU overflow 32bit counter in a second). For time computation use CPU clock (measure it or use registry value)
t0 <- RDTSC
Sleep(250);
t1 <- RDTSC
CPU f=(t1-t0)<<2 [Hz]
set process affinity to single CPU
all CPU cores have usually their own L1,L2 caches so on multi-task OS you can measure confusing things if you do not do this
do graphical output (diagram)
then you will see what actually happens in that link above I posted quite a few plots
use highest process priority available by OS