CPU cache critical stride test giving unexpected results based on access type

后端 未结 3 1301
野趣味
野趣味 2021-02-02 15:40

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I\'d spend some time to learn more about CPU caching a

相关标签:
3条回答
  • 2021-02-02 16:12

    I also tried to step on stride rake once I read about cache mechanics in Optimization C++ by Agner Frog.

    According to this books your second assumption is wrong, because memory address always belong to a specific cache line in a set. So every byte could be cached by the same cache lines in different "ways".

    My first attempt to do this in user space failed. (I have CPU i5-4200).

    Total size 128kb cache set size 8kb => time 18ms; 568000000
    Total size 256kb cache set size 16kb => time 13ms; 120000000
    Total size 384kb cache set size 24kb => time 12ms; 688000000
    Total size 512kb cache set size 32kb => time 14ms; 240000000
    

    $ g++ -std=c++11 -march=native -O3 hit-stride.cpp -o hit-stride

    #include<iostream>
    #include<chrono>
    
    using namespace std::chrono;
    using namespace std;
    
    int main(int argc, char** argv) {
      unsigned int cacheSetSizes[] = { 8, 16, 24, 32 };
      const int ways = 8;
    
      for (unsigned int i = 0; i < sizeof(cacheSetSizes) / sizeof(int); ++i) {
        const unsigned int setSize = cacheSetSizes[i] * 1024;
        const unsigned int size = setSize * ways * 2;
        char* buffer = new char[size];
        for (int k = 0; k < size; ++k) {
          buffer[k] = k % 127;
        }
        const auto started = steady_clock::now();
        int sum = 0;
        for (int j = 0; j < 1000000; ++j) {
          for (int k = 0; k < size; k += setSize) {
            sum += buffer[k];
          }
        }
        const auto ended = steady_clock::now();
        cout << "Total size " << (size >> 10) << "kb cache set size " << cacheSetSizes[i]
             << "kb => time " << duration_cast<milliseconds>(ended - started).count()
             << "ms; " << sum << endl;
        delete buffer;
      }
      return 0;
    }
    

    The "same" code wrapped into a kernel module looks like hits L2: I realized that I need to make memory physically contiguous. It's only possible to do in the kernel mode. My L1 cache size 32kb. In the test I walk over memory range longer that number of ways (8) with step equal to cache size. So I get noticeable slowdown on 32kb (last line).

    Apr 26 11:13:54 diehard kernel: [24992.943076] Memory 512 kb is allocated
    Apr 26 11:13:54 diehard kernel: [24992.969814] Duration  23524369 ns for cache set size         8 kb; sum = 568000000
    Apr 26 11:13:54 diehard kernel: [24992.990886] Duration  21076036 ns for cache set size        16 kb; sum = 120000000
    Apr 26 11:13:54 diehard kernel: [24993.013832] Duration  22950526 ns for cache set size        24 kb; sum = 688000000
    Apr 26 11:13:54 diehard kernel: [24993.045584] Duration  31760368 ns for cache set size        32 kb; sum = 240000000
    

    $ make && sudo insmod hello.ko && sleep 1 && tail -n 100 /var/log/syslog

    #include <linux/module.h>   /* Needed by all modules */
    #include <linux/kernel.h>   /* Needed for KERN_INFO */
    #include <linux/time.h>    
    
    static unsigned long p = 0;
    static struct timespec started, ended;
    static unsigned int cacheSetSizes[] = { 8, 16, 24, 32 };
    static const u32 ways = 8;
    static const u32 m = 2;
    static char* buffer;
    static unsigned int setSize;
    static unsigned int size;
    static unsigned int i, j, k;
    static int sum;
    
    int init_module(void) {
      s64 st, en, duration;
      u32 max = 1*1024*1024;
      printk(KERN_INFO "Hello world 1.\n");
      p = __get_free_pages(GFP_DMA, get_order(max));
      printk(KERN_INFO "Memory %u kb is allocated\n", ways * m * 32);
      buffer = (char*) p;
    
      for (k = 0; k < max; ++k) {
        buffer[k] = k % 127;
      }
    
      for (i = 0; i < sizeof(cacheSetSizes) / sizeof(int); ++i) {
        setSize = cacheSetSizes[i] * 1024;
        size = setSize * ways * m;
        if (size > max) {
          printk(KERN_INFO "size %u is more that %u", size, max);
          return 0;
        }
        getnstimeofday(&started);
        st = timespec_to_ns(&started);
    
        sum = 0;
        for (j = 0; j < 1000000; ++j) {
          for (k = 0; k < size; k += setSize) {
            sum += buffer[k];
          }
        }
    
        getnstimeofday(&ended);
        en = timespec_to_ns(&ended);
        duration = en - st;
        printk(KERN_INFO "Duration %9lld ns for cache set size %9u kb; sum = %9d\n",
               duration, cacheSetSizes[i], sum);
      }
      return 0;
    }
    
    void cleanup_module(void) {
      printk(KERN_INFO "Goodbye world 1.\n");
      free_pages(p, get_order(1*1024*1024));
      printk(KERN_INFO "Memory is free\n");
    }
    
    0 讨论(0)
  • 2021-02-02 16:17

    With regards to your expectation number 3, you are right. It is as you might expect. Please check "What every Programmer should know about memory" for more details. It's an excellent series of articles explaining the memory hierarchy.

    So why is it hard to confirm number 3: There are two main reasons. One is memory allocation and the other is virtual-physical address translation.

    Memory Allocation

    There is no strict guarantee what the actual physical address of an allocated memory region is. When you want to test CPU caches I always recommend using posix_memalign to force the allocation to a specific boundary. Otherwise you probably see some weird behavior.

    Address Translation

    The way how address translation works is nicely explained in the article I mentioned. And to verify your assumption you have to try to pinpoint the expected behaviour. The easiest way to do this is as follows:

    Experiment

    Allocate a set of k large memory regions (something like 512MB) in form of int arrays and align them all to the page boundary of 4096b. Now iterate over all elements in the memory region and incrementally add more regions of k to your experiment. Measure the time and normalize by the number of elements read.

    The code could look like:

    #define N 10000000
    for(size_t i=0; i < k; ++i) {
    
       size_t sum=0;
       clock_t t1= clock();
       for(size_t j=0; j < N; ++j) {
           for(size_t u=0; u<i; ++u) {
               sum += data[u][j];
           }
       }
    
       clock_t t2= clock();
    
    }
    

    So what will happen. All large memory regions are aligned to 4k and based on the previous assumption all elements of same row will map into the same cache set. When the number of projected memory regions in the loop is larger than the associativity of the cache, all access will incur a cache miss and the average processing time per element will increase.

    Update

    How writes are handled depends on how the cache line is used and the CPU. Modern CPUs apply the MESI protocol for handling writes to cache lines to make sure that all parties have the same view on the memory (cache coherency). Typically before you can write to a cache line the cache line must be read and then written back. If you recognize the write-back or not depends on how you access the data. If you re-read the cache line again, you will probably not notice a difference.

    However, while the programmer has typically no influence on how the data is stored in the CPU caches, with writing there is a slight difference. It is possible to perform so called streaming writes that do not pollute the cache but are rather written directly to memory. These writes are also called non-temporal writes.

    0 讨论(0)
  • 2021-02-02 16:19

    First of all, there's a small clarification that needs to be made - in most cases, a write would still require you to fetch the line into the local cache, since the lines are usually 64Byte and your write might only modify a partial chunk of that - the merge will be made in the cache. Even if you were to write the whole line in one go (which could in theory be possible in some cases), you would still need to wait for the access in order to receive ownership of the line before writing to it - this protocol is called RFO (read for ownership), and it could be quite long, especially if you have a multi-socket system or anything with complicated memory hierarchy.

    Having that said, your 4th assumption may still be correct in some cases, since a load operation will indeed require the data to be fetched before the program advances, while a store can be buffered to write later on when possible. However, the load will only stall the program if it's in some critical path (meaning that some other operation waits for its result), a behavior which your test program doesn't exercise. Since most modern CPUs offer out-of-order execution, the following independent instructions are free to go without waiting for the load to complete. In your program, there's no inter-loop dependency except for the simple index advance (which can run ahead easily), so you're basically not bottlenecked on memory latency but rather on memory throughput, which is a totally different thing. By the way, to add such dependency, you could emulate linked-list traversal, or even simpler - make sure that the array is initialized to zero (and switch the writes to zeros only), and add the content of each read value to the index on each iteration (in addition to the increment) - this would create a dependency without changing the addresses themselves. Alternatively, do something nasty like this (assuming that the compiler isn't smart enough to drop this...):

        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
            index += buffer[index];
            index -= buffer[index];
        }
    

    Now, about the results, it seems that the write vs read+write behave the same when you're jumping by the critical step, as expected (since the read doesn't differ much from the RFO that would be issued by the write anyway). However, for the non-critical step the read+write operation is much slower. Now it's hard to tell without knowing the exact system, but this could happen due to the fact that loads (reads) and stores (writes) are not performed at the same stage in the lifetime of an instructions - this means that between the load and the store that follows, you may already have evicted the line and need to fetch it again a second time. I'm not too sure about that, but if you want to check, maybe you could add an sfence assembly instruction between the iterations (although that would slow you down significantly).

    One last note - when you are bandwidth limited, writing can slow you down quite a bit due to another requirement - when you write to memory you fetch a line to the cache and modify it. Modified lines need to be written back to memory (although in reality there's a whole set of lower level caches on the way), which requires resources and can clog down your machine. Try a read-only loop and see how it goes.

    0 讨论(0)
提交回复
热议问题