Measuring Cache Latencies

前端 未结 5 1643
感情败类
感情败类 2020-11-28 20:16

So I am trying to measure the latencies of L1, L2, L3 cache using C. I know the size of them and I feel I understand conceptually how to do it but I am running into problems

相关标签:
5条回答
  • 2020-11-28 20:40

    Ok, several issues with your code:

    1. As you mentioned, your measurement are taking a long time. In fact, they're very likely to take way longer than the single access itself, so they're not measuring anything useful. To mitigate that, access multiple elements, and amortize (divide the overall time by the number of accesses. Note that to measure latency, you want these accesses to be serialized, otherwise they can be performed in parallel and you'll only measure the throughput of unrelated accesses. To achieve that you could just add a false dependency between accesses.

      For e.g., initialize the array to zeros, and do:

      clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
      for (int i = 0; i < NUM_ACCESSES; ++i) {
          int tmp = arrayAccess[index];                             //Access Value from Main Memory
          index = (index + i + tmp) & 1023;   
      }
      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
      

      .. and of course remember to divide the time by NUM_ACCESSES.
      Now, i've made the index intentionally complicated so that you avoid a fixed stride which might trigger a prefetcher (a bit of an overkill, you're not likely to notice an impact, but for the sake of demonstration...). You could probably settle for a simple index += 32, which would give you strides of 128k (two cache lines), and avoid the "benefit" of most simple adjacent line/ simple stream prefetchers. I've also replaced the % 1000 with & 1023 since & is faster, but it needs to be power of 2 to work the same way - so just increase ACCESS_SIZE to 1024 and it should work.

    2. Invalidating the L1 by loading something else is good, but the sizes look funny. You didn't specify your system but 256000 seems pretty big for an L1. An L2 is usually 256k on many common modern x86 CPUs for e.g. Also note that 256k is not 256000, but rather 256*1024=262144. Same goes for the second size: 1M is not 1024000, it's 1024*1024=1048576. Assuming that's indeed your L2 size (more likely an L3, but probably too small for that).

    3. Your invalidating arrays are of type int, so each element is longer than a single byte (most likely 4 byte, depending on system). You're actually invalidating L1_CACHE_SIZE*sizeof(int) worth of bytes (and the same goes for the L2 invalidation loop)

    Update:

    1. memset receives the size in bytes, your sizes are divided by sizeof(int)

    2. Your invalidation reads are never used, and may be optimized out. Try to accumulate the reads in some value and print it in the end, to avoid this possibility.

    3. The memset at the beginning is accessing the data as well, therefor your first loop is accessing data from the L3 (since the other 2 memsets were still effective in evicting it from L1+L2, although only partially due to the size error.

    4. The strides may be too small so you get two access to the same cacheline (L1 hit). Make sure they're spread enough by adding 32 elements (x4 bytes) - that's 2 cacheline, so you also won't get any adjacent cacheline prefetch benefits.

    5. Since NUM_ACCESSES is larger than ACCESS_SIZE, you're essentially repeating the same elements and would probably get L1 hits for them (so the avg time shifts in favor of L1 access latency). Instead try using the L1 size so you access the entire L1 (except for the skips) exactly once. For e.g. like this -

      index = 0;
      while (index < L1_CACHE_SIZE) {
          int tmp = arrayAccess[index];               //Access Value from L2
          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
          count++;                                           //divide overall time by this 
      }
      

    don't forget to increase arrayAccess to L1 size.

    Now, with the changes above (more or less), I get something like this:

    L1 Cache Access 7.812500
    L2 Cache Acces 15.625000
    L3 Cache Access 23.437500
    

    Which still seems a bit long, but possibly because it includes an additional dependency on arithmetic operations

    0 讨论(0)
  • 2020-11-28 20:44

    I would rather try to use the hardware clock as a measure. The rdtsc instruction will tell you the current cycle count since the CPU was powered up. Also it is better to use asm to make sure always the same instructions are used in both measured and dry runs. Using that and some clever statistics I have made this a long time ago:

    #include <stdlib.h>
    #include <stdio.h>
    #include <stdint.h>
    #include <fcntl.h>
    #include <unistd.h>
    #include <string.h>
    #include <sys/mman.h>
    
    
    int i386_cpuid_caches (size_t * data_caches) {
        int i;
        int num_data_caches = 0;
        for (i = 0; i < 32; i++) {
    
            // Variables to hold the contents of the 4 i386 legacy registers
            uint32_t eax, ebx, ecx, edx; 
    
            eax = 4; // get cache info
            ecx = i; // cache id
    
            asm (
                "cpuid" // call i386 cpuid instruction
                : "+a" (eax) // contains the cpuid command code, 4 for cache query
                , "=b" (ebx)
                , "+c" (ecx) // contains the cache id
                , "=d" (edx)
            ); // generates output in 4 registers eax, ebx, ecx and edx 
    
            // taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149
            int cache_type = eax & 0x1F; 
    
            if (cache_type == 0) // end of valid cache identifiers
                break;
    
            char * cache_type_string;
            switch (cache_type) {
                case 1: cache_type_string = "Data Cache"; break;
                case 2: cache_type_string = "Instruction Cache"; break;
                case 3: cache_type_string = "Unified Cache"; break;
                default: cache_type_string = "Unknown Type Cache"; break;
            }
    
            int cache_level = (eax >>= 5) & 0x7;
    
            int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
            int cache_is_fully_associative = (eax >>= 1) & 0x1;
    
    
            // taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A
            // ebx contains 3 integers of 10, 10 and 12 bits respectively
            unsigned int cache_sets = ecx + 1;
            unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
            unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
            unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;
    
            // Total cache size is the product
            size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;
    
            if (cache_type == 1 || cache_type == 3) {
                data_caches[num_data_caches++] = cache_total_size;
            }
    
            printf(
                "Cache ID %d:\n"
                "- Level: %d\n"
                "- Type: %s\n"
                "- Sets: %d\n"
                "- System Coherency Line Size: %d bytes\n"
                "- Physical Line partitions: %d\n"
                "- Ways of associativity: %d\n"
                "- Total Size: %zu bytes (%zu kb)\n"
                "- Is fully associative: %s\n"
                "- Is Self Initializing: %s\n"
                "\n"
                , i
                , cache_level
                , cache_type_string
                , cache_sets
                , cache_coherency_line_size
                , cache_physical_line_partitions
                , cache_ways_of_associativity
                , cache_total_size, cache_total_size >> 10
                , cache_is_fully_associative ? "true" : "false"
                , cache_is_self_initializing ? "true" : "false"
            );
        }
    
        return num_data_caches;
    }
    
    int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) {
        int fd = open("/dev/urandom", O_RDONLY);
        if (fd < 0) {
            perror("open");
            abort();
        }
        char * random_data = mmap(
              NULL
            , lower_cache_size
            , PROT_READ | PROT_WRITE
            , MAP_PRIVATE | MAP_ANON // | MAP_POPULATE
            , -1
            , 0
            ); // get some random data
        if (random_data == MAP_FAILED) {
            perror("mmap");
            abort();
        }
    
        size_t i;
        for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) {
            random_data[i] = 1;
        }
    
    
        int64_t random_offset = 0;
        while (attempts--) {
            // use processor clock timer for exact measurement
            random_offset += rand();
            random_offset %= lower_cache_size;
            int32_t cycles_used, edx, temp1, temp2;
            asm (
                "mfence\n\t"        // memory fence
                "rdtsc\n\t"         // get cpu cycle count
                "mov %%edx, %2\n\t"
                "mov %%eax, %3\n\t"
                "mfence\n\t"        // memory fence
                "mov %4, %%al\n\t"  // load data
                "mfence\n\t"
                "rdtsc\n\t"
                "sub %2, %%edx\n\t" // substract cycle count
                "sbb %3, %%eax"     // substract cycle count
                : "=a" (cycles_used)
                , "=d" (edx)
                , "=r" (temp1)
                , "=r" (temp2)
                : "m" (random_data[random_offset])
                );
            // printf("%d\n", cycles_used);
            if (cycles_used < max_latency)
                latencies[cycles_used]++;
            else 
                latencies[max_latency - 1]++;
        }
    
        munmap(random_data, lower_cache_size);
    
        return 0;
    } 
    
    int main() {
        size_t cache_sizes[32];
        int num_data_caches = i386_cpuid_caches(cache_sizes);
    
        int latencies[0x400];
        memset(latencies, 0, sizeof(latencies));
    
        int empty_cycles = 0;
    
        int i;
        int attempts = 1000000;
        for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles
            int32_t cycles_used, edx, temp1, temp2;
            asm (
                "mfence\n\t"        // memory fence
                "rdtsc\n\t"         // get cpu cycle count
                "mov %%edx, %2\n\t"
                "mov %%eax, %3\n\t"
                "mfence\n\t"        // memory fence
                "mfence\n\t"
                "rdtsc\n\t"
                "sub %2, %%edx\n\t" // substract cycle count
                "sbb %3, %%eax"     // substract cycle count
                : "=a" (cycles_used)
                , "=d" (edx)
                , "=r" (temp1)
                , "=r" (temp2)
                :
                );
            if (cycles_used < sizeof(latencies) / sizeof(*latencies))
                latencies[cycles_used]++;
            else 
                latencies[sizeof(latencies) / sizeof(*latencies) - 1]++;
    
        }
    
        {
            int j;
            size_t sum = 0;
            for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
                sum += latencies[j];
            }
            size_t sum2 = 0;
            for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
                sum2 += latencies[j];
                if (sum2 >= sum * .75) {
                    empty_cycles = j;
                    fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles);
                    break;
                }
            }
        }
    
        for (i = 0; i < num_data_caches; i++) {
            test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));
    
            int j;
            size_t sum = 0;
            for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
                sum += latencies[j];
            }
            size_t sum2 = 0;
            for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
                sum2 += latencies[j];
                if (sum2 >= sum * .75) {
                    fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles);
                    break;
                }
            }
    
        }
    
        return 0;
    
    }
    

    Output on my Core2Duo:

    Cache ID 0:
    - Level: 1
    - Type: Data Cache
    - Total Size: 32768 bytes (32 kb)
    
    Cache ID 1:
    - Level: 1
    - Type: Instruction Cache
    - Total Size: 32768 bytes (32 kb)
    
    Cache ID 2:
    - Level: 2
    - Type: Unified Cache
    - Total Size: 262144 bytes (256 kb)
    
    Cache ID 3:
    - Level: 3
    - Type: Unified Cache
    - Total Size: 3145728 bytes (3072 kb)
    
    Empty counting takes 90 cycles
    Cache ID 0 has latency 6 cycles
    Cache ID 2 has latency 21 cycles
    Cache ID 3 has latency 168 cycles
    
    0 讨论(0)
  • 2020-11-28 20:48

    Widely used classic test for cache latency is iterating over the linked list. It works on modern superscalar/superpipelined CPU and even on Out-of-order cores like ARM Cortex-A9+ and Intel Core 2/ix. This method is used by open-source lmbench - in the test lat_mem_rd (man page) and in CPU-Z latency measurement tool: http://cpuid.com/medias/files/softwares/misc/latency.zip (native Windows binary)

    There are sources of lat_mem_rd test from lmbench: https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c

    And the main test is

    #define ONE p = (char **)*p;
    #define FIVE    ONE ONE ONE ONE ONE
    #define TEN FIVE FIVE
    #define FIFTY   TEN TEN TEN TEN TEN
    #define HUNDRED FIFTY FIFTY
    
    void
    benchmark_loads(iter_t iterations, void *cookie)
    {
        struct mem_state* state = (struct mem_state*)cookie;
        register char **p = (char**)state->p[0];
        register size_t i;
        register size_t count = state->len / (state->line * 100) + 1;
    
        while (iterations-- > 0) {
            for (i = 0; i < count; ++i) {
                HUNDRED;
            }
        }
    
        use_pointer((void *)p);
        state->p[0] = (char*)p;
    }
    

    So, after deciphering the macro we do a lot of linear operations like:

     p = (char**) *p;  // (in intel syntax) == mov eax, [eax]
     p = (char**) *p;
     p = (char**) *p;
     ....   // 100 times total
     p = (char**) *p;
    

    over the memory, filled with pointers, every pointing stride elements forward.

    As says the man page http://www.bitmover.com/lmbench/lat_mem_rd.8.html

    The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by

     p = (char **)*p;
    

    in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads. The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.

    More detailed description with examples on POWERs is available from IBM's wiki: Untangling memory access measurements - lat_mem_rd - by Jenifer Hopper 2013

    The lat_mem_rd test (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) takes two arguments, an array size in MB and a stride size. The benchmark uses two loops to traverse through the array, using the stride as the increment by creating a ring of pointers that point forward one stride. The test measures memory read latency in nanoseconds for the range of memory sizes. The output consists of two columns: the first is the array size in MB (the floating point value) and the second is the load latency over all the points of the array. When the results are graphed, you can clearly see the relative latencies of the entire memory hierarchy, including the faster latency of each cache level, and the main memory latency.

    PS: There is paper from Intel (thanks to Eldar Abusalimov) with examples of running lat_mem_rd: ftp://download.intel.com/design/intarch/PAPERS/321074.pdf - sorry right url is http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-cache-latency-bandwidth-paper.pdf "Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel Architecture" by Joshua Ruggiero from December 2008:

    0 讨论(0)
  • 2020-11-28 20:55

    Well for those interested, I couldn't get my first code set to work so I tried a couple alternative approaches that produced decent results.

    The first used linked lists with nodes allocated stride bytes apart in a contiguous memory space. The dereferencing of the nodes mitigates the effectiveness of the pre-fetcher and in the case that multiple cache lines are pulled in I made the strides significantly large to avoid cache hits. As the size of the list allocated increases, it jumps to the cache or memory structure that will hold it showing clear divisions in latency.

    #include <time.h>
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <math.h>
    
    //MACROS
    #define ONE iterate = (char**) *iterate;
    #define FIVE ONE ONE ONE
    #define TWOFIVE FIVE FIVE FIVE FIVE FIVE
    #define HUNDO TWOFIVE TWOFIVE TWOFIVE TWOFIVE
    
    //prototype
    void allocateRandomArray(long double);
    void accessArray(char *, long double, char**);
    
    int main(){
        //call the function for allocating arrays of increasing size in MB
        allocateRandomArray(.00049);
        allocateRandomArray(.00098);
        allocateRandomArray(.00195);
        allocateRandomArray(.00293);
        allocateRandomArray(.00391);
        allocateRandomArray(.00586);
        allocateRandomArray(.00781);
        allocateRandomArray(.01172);
        allocateRandomArray(.01562);
        allocateRandomArray(.02344);
        allocateRandomArray(.03125);
        allocateRandomArray(.04688);
        allocateRandomArray(.0625);
        allocateRandomArray(.09375);
        allocateRandomArray(.125);
        allocateRandomArray(.1875);
        allocateRandomArray(.25);
        allocateRandomArray(.375);
        allocateRandomArray(.5);
        allocateRandomArray(.75);
        allocateRandomArray(1);
        allocateRandomArray(1.5);
        allocateRandomArray(2);
        allocateRandomArray(3);
        allocateRandomArray(4);
        allocateRandomArray(6);
        allocateRandomArray(8);
        allocateRandomArray(12);
        allocateRandomArray(16);
        allocateRandomArray(24);
        allocateRandomArray(32);
        allocateRandomArray(48);
        allocateRandomArray(64);
        allocateRandomArray(96);
        allocateRandomArray(128);
        allocateRandomArray(192);
    }
    
    void allocateRandomArray(long double size){
        int accessSize=(1024*1024*size); //array size in bytes
        char * randomArray = malloc(accessSize*sizeof(char));    //allocate array of size allocate size
        int counter;
        int strideSize=4096;        //step size
    
        char ** head = (char **) randomArray;   //start of linked list in contiguous memory
        char ** iterate = head;         //iterator for linked list
        for(counter=0; counter < accessSize; counter+=strideSize){      
            (*iterate) = &randomArray[counter+strideSize];      //iterate through linked list, having each one point stride bytes forward
            iterate+=(strideSize/sizeof(iterate));          //increment iterator stride bytes forward
        }
        *iterate = (char *) head;       //set tailf to point to head
    
        accessArray(randomArray, size, head);
        free(randomArray);
    }
    
    void accessArray(char *cacheArray, long double size, char** head){
        const long double NUM_ACCESSES = 1000000000/100;    //number of accesses to linked list
        const int SECONDS_PER_NS = 1000000000;      //const for timer
        FILE *fp =  fopen("accessData.txt", "a");   //open file for writing data
        int newIndex=0;
        int counter=0;
        int read=0;
        struct timespec startAccess, endAccess;     //struct for timer
        long double accessTime = 0;
        char ** iterate = head;     //create iterator
    
        clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
        for(counter=0; counter < NUM_ACCESSES; counter++){
            HUNDO       //macro subsitute 100 accesses to mitigate loop overhead
        }
        clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
        //calculate the time elapsed in ns per access
        accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (100*NUM_ACCESSES);
        fprintf(fp, "%Lf\t%Lf\n", accessTime, size);  //print results to file
        fclose(fp);  //close file
    }
    

    This produced the most consistent results, and using a variety of array sizes and plotting the respective latencies gave a very clear distinction of the different cache sizes present.

    The next method like the previous allocated increasing size arrays. But instead of using a linked list for memory access, I fill each index with its respective number and randomly shuffled the array. I then used these indexes to hop around randomly within the array for accesses, mitigating the effects of the pre-fetcher. However, it had an occasional strong deviation in access time when multiple adjacent cache lines are pulled in and happen to be hit.

    #include <time.h>
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <math.h>
    
    //prototype
    void allocateRandomArray(long double);
    void accessArray(int *, long int);
    
    int main(){
        srand(time(NULL));  // Seed random function
        int i=0;
        for(i=2; i < 32; i++){
            allocateRandomArray(pow(2, i));         //call latency function on arrays of increasing size
        }
    
    
    }
    
    void allocateRandomArray(long double size){
        int accessSize = (size) / sizeof(int);
        int * randomArray = malloc(accessSize*sizeof(int));
        int counter;
    
        for(counter=0; counter < accessSize; counter ++){
            randomArray[counter] = counter; 
        }
        for(counter=0; counter < accessSize; counter ++){
            int i,j;
            int swap;
            i = rand() % accessSize;
            j = rand() % accessSize;
            swap = randomArray[i];
            randomArray[i] = randomArray[j];
            randomArray[j] = swap;
        } 
    
        accessArray(randomArray, accessSize);
        free(randomArray);
    }
    
    void accessArray(int *cacheArray, long int size){
        const long double NUM_ACCESSES = 1000000000;
        const int SECONDS_PER_NS = 1000000000;
        int newIndex=0;
        int counter=0;
        int read=0;
        struct timespec startAccess, endAccess;
        long double accessTime = 0;
    
        clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
        for(counter = 0; counter < NUM_ACCESSES; counter++){
            newIndex=cacheArray[newIndex];
        }
        clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
        //calculate the time elapsed in ns per access
        accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (NUM_ACCESSES);
        printf("Access time: %Lf for size %ld\n", accessTime, size);
    } 
    

    Averaged across many trials, this method produced relatively accurate results as well. The first choice is definitely the better of the two but this is an alternate approach that works fine as well.

    0 讨论(0)
  • 2020-11-28 20:58

    Not really an answer but read anyway some thing has already been mentioned in other answers and comments here

    well just the other day I answer this question:

    • Cache size estimation on your system?

    it is about measurement of L1/L2/.../L?/MEMORY transfer rates take a look at it for better start point of your problem

    [Notes]

    1. I strongly recommend to use RDTSC instruction for time measurement

      especially for L1 as anything else is too slow. Do not forget to set process affinity to single CPU because all cores have their own counter and their count differs a lot even on the same input Clock !!!

      Adjust the CPU clock to Maximum for variable clock computers and do not forget to account for RDTSC overflow if you use just 32bit part (modern CPU overflow 32bit counter in a second). For time computation use CPU clock (measure it or use registry value)

      t0 <- RDTSC
      Sleep(250);
      t1 <- RDTSC
      CPU f=(t1-t0)<<2 [Hz]
      
    2. set process affinity to single CPU

      all CPU cores have usually their own L1,L2 caches so on multi-task OS you can measure confusing things if you do not do this

    3. do graphical output (diagram)

      then you will see what actually happens in that link above I posted quite a few plots

    4. use highest process priority available by OS

    0 讨论(0)
提交回复
热议问题