CPU cache critical stride test giving unexpected results based on access type

后端 未结 3 1312
野趣味
野趣味 2021-02-02 15:40

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I\'d spend some time to learn more about CPU caching a

3条回答
  •  故里飘歌
    2021-02-02 16:17

    With regards to your expectation number 3, you are right. It is as you might expect. Please check "What every Programmer should know about memory" for more details. It's an excellent series of articles explaining the memory hierarchy.

    So why is it hard to confirm number 3: There are two main reasons. One is memory allocation and the other is virtual-physical address translation.

    Memory Allocation

    There is no strict guarantee what the actual physical address of an allocated memory region is. When you want to test CPU caches I always recommend using posix_memalign to force the allocation to a specific boundary. Otherwise you probably see some weird behavior.

    Address Translation

    The way how address translation works is nicely explained in the article I mentioned. And to verify your assumption you have to try to pinpoint the expected behaviour. The easiest way to do this is as follows:

    Experiment

    Allocate a set of k large memory regions (something like 512MB) in form of int arrays and align them all to the page boundary of 4096b. Now iterate over all elements in the memory region and incrementally add more regions of k to your experiment. Measure the time and normalize by the number of elements read.

    The code could look like:

    #define N 10000000
    for(size_t i=0; i < k; ++i) {
    
       size_t sum=0;
       clock_t t1= clock();
       for(size_t j=0; j < N; ++j) {
           for(size_t u=0; u

    So what will happen. All large memory regions are aligned to 4k and based on the previous assumption all elements of same row will map into the same cache set. When the number of projected memory regions in the loop is larger than the associativity of the cache, all access will incur a cache miss and the average processing time per element will increase.

    Update

    How writes are handled depends on how the cache line is used and the CPU. Modern CPUs apply the MESI protocol for handling writes to cache lines to make sure that all parties have the same view on the memory (cache coherency). Typically before you can write to a cache line the cache line must be read and then written back. If you recognize the write-back or not depends on how you access the data. If you re-read the cache line again, you will probably not notice a difference.

    However, while the programmer has typically no influence on how the data is stored in the CPU caches, with writing there is a slight difference. It is possible to perform so called streaming writes that do not pollute the cache but are rather written directly to memory. These writes are also called non-temporal writes.

提交回复
热议问题