C++ Cache performance odd behavior

前端 未结 4 588
天涯浪人
天涯浪人 2021-02-04 19:44

I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They

相关标签:
4条回答
  • 2021-02-04 20:15

    Some results (OSX, Sandy Bridge):

    GCC -O0

    Size=1 ITERATIONS=1073741824, TIME=2416.06 ms
    Size=2 ITERATIONS=536870912, TIME=1885.46 ms
    Size=4 ITERATIONS=268435456, TIME=1782.92 ms
    Size=16 ITERATIONS=67108864, TIME=2023.71 ms
    Size=32 ITERATIONS=33554432, TIME=2184.99 ms
    Size=64 ITERATIONS=16777216, TIME=2464.09 ms
    Size=256 ITERATIONS=4194304, TIME=2358.31 ms
    Size=1024 ITERATIONS=1048576, TIME=2333.77 ms
    Size=2048 ITERATIONS=524288, TIME=2340.16 ms
    Size=4096 ITERATIONS=262144, TIME=2349.97 ms
    Size=8192 ITERATIONS=131072, TIME=2346.96 ms
    Size=16384 ITERATIONS=65536, TIME=2350.3 ms
    Size=32768 ITERATIONS=32768, TIME=2348.71 ms
    Size=65536 ITERATIONS=16384, TIME=2355.28 ms
    Size=262144 ITERATIONS=4096, TIME=2358.97 ms
    Size=524288 ITERATIONS=2048, TIME=2476.46 ms
    Size=1048576 ITERATIONS=1024, TIME=2429.07 ms
    Size=2097152 ITERATIONS=512, TIME=2427.09 ms
    Size=4194304 ITERATIONS=256, TIME=2443.42 ms
    Size=8388608 ITERATIONS=128, TIME=2435.54 ms
    Size=33554432 ITERATIONS=32, TIME=2389.08 ms
    Size=134217728 ITERATIONS=8, TIME=2444.43 ms
    Size=536870912 ITERATIONS=2, TIME=2600.91 ms
    

    GCC -O3

    Size=1 ITERATIONS=1073741824, TIME=2197.12 ms
    Size=2 ITERATIONS=536870912, TIME=996.409 ms
    Size=4 ITERATIONS=268435456, TIME=606.252 ms
    Size=16 ITERATIONS=67108864, TIME=306.904 ms
    Size=32 ITERATIONS=33554432, TIME=897.692 ms
    Size=64 ITERATIONS=16777216, TIME=847.794 ms
    Size=256 ITERATIONS=4194304, TIME=802.136 ms
    Size=1024 ITERATIONS=1048576, TIME=761.971 ms
    Size=2048 ITERATIONS=524288, TIME=760.136 ms
    Size=4096 ITERATIONS=262144, TIME=759.149 ms
    Size=8192 ITERATIONS=131072, TIME=749.881 ms
    Size=16384 ITERATIONS=65536, TIME=756.672 ms
    Size=32768 ITERATIONS=32768, TIME=759.565 ms
    Size=65536 ITERATIONS=16384, TIME=754.81 ms
    Size=262144 ITERATIONS=4096, TIME=745.899 ms
    Size=524288 ITERATIONS=2048, TIME=749.527 ms
    Size=1048576 ITERATIONS=1024, TIME=758.009 ms
    Size=2097152 ITERATIONS=512, TIME=776.671 ms
    Size=4194304 ITERATIONS=256, TIME=778.963 ms
    Size=8388608 ITERATIONS=128, TIME=783.191 ms
    Size=33554432 ITERATIONS=32, TIME=770.603 ms
    Size=134217728 ITERATIONS=8, TIME=785.703 ms
    Size=536870912 ITERATIONS=2, TIME=911.875 ms
    

    (Note how the first one is really slower, I feel like there may be a mis-speculation somewhere around load-store forwarding...)

    Interestingly turning the optimizations on and removing the volatile shows a somehow nicer curve:

    Size=1 ITERATIONS=1073741824, TIME=0 ms
    Size=2 ITERATIONS=536870912, TIME=0 ms
    Size=4 ITERATIONS=268435456, TIME=0 ms
    Size=16 ITERATIONS=67108864, TIME=0.001 ms
    Size=32 ITERATIONS=33554432, TIME=125.581 ms
    Size=64 ITERATIONS=16777216, TIME=140.654 ms
    Size=256 ITERATIONS=4194304, TIME=217.559 ms
    Size=1024 ITERATIONS=1048576, TIME=168.155 ms
    Size=2048 ITERATIONS=524288, TIME=159.031 ms
    Size=4096 ITERATIONS=262144, TIME=154.373 ms
    Size=8192 ITERATIONS=131072, TIME=153.858 ms
    Size=16384 ITERATIONS=65536, TIME=156.819 ms
    Size=32768 ITERATIONS=32768, TIME=156.505 ms
    Size=65536 ITERATIONS=16384, TIME=156.921 ms
    Size=262144 ITERATIONS=4096, TIME=215.911 ms
    Size=524288 ITERATIONS=2048, TIME=220.298 ms
    Size=1048576 ITERATIONS=1024, TIME=235.648 ms
    Size=2097152 ITERATIONS=512, TIME=320.284 ms
    Size=4194304 ITERATIONS=256, TIME=409.433 ms
    Size=8388608 ITERATIONS=128, TIME=431.743 ms
    Size=33554432 ITERATIONS=32, TIME=429.436 ms
    Size=134217728 ITERATIONS=8, TIME=430.052 ms
    Size=536870912 ITERATIONS=2, TIME=535.773 ms
    

    To help anyone reproduce the "issue", here is some standard (I hope) C++ code:

    #include <string>
    #include <iostream>
    #include <chrono>
    #include <cstdlib>
    #include <memory>
    
    template <unsigned int SIZE, unsigned int ITERATIONS>
    void test_body(volatile int *array) {
        for (int i = 0; i < ITERATIONS; i++)
        {
            for (int  x = 0; x < SIZE; x++)
            {
                array[x]++;
            }
        }
    
    }
    
    
    template <unsigned int SIZE, unsigned int ITERATIONS>
    static void test_function()
    {
        static_assert(SIZE*ITERATIONS == 1024*1024*1024, "SIZE MISMATCH");
        std::unique_ptr<volatile int[]> array { new int[SIZE] };
    
        // Warmup
        test_body<SIZE, 1>(array.get());
    
        auto start = std::chrono::steady_clock::now();
    
        test_body<SIZE, ITERATIONS>(array.get());
    
        auto end = std::chrono::steady_clock::now();
        auto diff = end - start;
        std::cout << "Size=" << SIZE << " ITERATIONS=" << ITERATIONS << ", TIME=" << std::chrono::duration <double, std::milli> (diff).count() << " ms" << std::endl;
    }
    
    int main()
    {
        enum { eIterations= 1024*1024*1024};
        test_function<1, eIterations>();
        test_function<2, eIterations/2>();
        test_function<4, eIterations/4>();
        test_function<16, eIterations/16>();
        test_function<32, eIterations/ 32>();
        test_function<64, eIterations/ 64>();
        test_function<256, eIterations/ 256>();
        test_function<1024, eIterations/ 1024>();
        test_function<2048, eIterations/ 2048>();
        test_function<4096, eIterations/ 4096>();
        test_function<8192, eIterations/ 8192>();
        test_function<16384, eIterations/ 16384>();
        test_function<32768, eIterations/ 32768>();
        test_function<65536, eIterations/ 65536>();
        test_function<262144, eIterations/ 262144>();
        test_function<524288, eIterations/ 524288>();
        test_function<1048576, eIterations/ 1048576>();
        test_function<2097152, eIterations/ 2097152>();
        test_function<4194304, eIterations/ 4194304>();
        test_function<8388608, eIterations/ 8388608>();
        test_function<33554432, eIterations/ 33554432>();
        test_function<134217728, eIterations/ 134217728>();
        test_function<536870912, eIterations/ 536870912>();
    }
    
    0 讨论(0)
  • 2021-02-04 20:15

    It seems clear that constant time implies a constant instruction execution rate. To measure cache/RAM speed, data transfer instructions should predominate and results require further clarification than run time, like MB/second and instructions per second. You need something like my BusSpeed benchmark (Google for Roy BusSpeed benchmark or BusSpd2k for source codes and results with versions for Windows, Linux and Android). The original used assembly code with instructions like:

       "add     edx,ecx"     \
       "mov     ebx,[edi]"   \
       "mov     ecx,ebx"     \
    "lp: and     ebx,[edx]"   \
       "and     ecx,[edx+4]"   \
       "and     ebx,[edx+8]"   \
       "and     ecx,[edx+12]"   \
       "and     ebx,[edx+16]"   \
       "and     ecx,[edx+20]"   \
       "and     ebx,[edx+24]"   \
       "and     ecx,[edx+28]"   \
       "and     ebx,[edx+32]"   \
       "and     ecx,[edx+36]"   \
       "and     ebx,[edx+40]"   \
    
     To
    
       "and     ecx,[edx+236]"   \
       "and     ebx,[edx+240]"   \
       "and     ecx,[edx+244]"   \
       "and     ebx,[edx+248]"   \
       "and     ecx,[edx+252]"   \
       "add     edx,256"     \
       "dec     eax"         \
       "jnz     lp"          \
       "and     ebx,ecx"     \
       "mov     [edi],ebx"     \             
    

    Later versions used C as follows

    void inc1word()
    {
       int i, j;
    
       for(j=0; j<passes1; j++)
       {
           for (i=0; i<wordsToTest; i=i+64)
           {
               andsum1 = andsum1 & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
                                 & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
                                 & array[i+8 ] & array[i+9 ] & array[i+10] & array[i+11]
                                 & array[i+12] & array[i+13] & array[i+14] & array[i+15]
                                 & array[i+16] & array[i+17] & array[i+18] & array[i+19]
                                 & array[i+20] & array[i+21] & array[i+22] & array[i+23]
                                 & array[i+24] & array[i+25] & array[i+26] & array[i+27]
                                 & array[i+28] & array[i+29] & array[i+30] & array[i+31]
                                 & array[i+32] & array[i+33] & array[i+34] & array[i+35]
                                 & array[i+36] & array[i+37] & array[i+38] & array[i+39]
                                 & array[i+40] & array[i+41] & array[i+42] & array[i+43]
                                 & array[i+44] & array[i+45] & array[i+46] & array[i+47]
                                 & array[i+48] & array[i+49] & array[i+50] & array[i+51]
                                 & array[i+52] & array[i+53] & array[i+54] & array[i+55]
                                 & array[i+56] & array[i+57] & array[i+58] & array[i+59]
                                 & array[i+60] & array[i+61] & array[i+62] & array[i+63];
           }
       }
    }
    

    The benchmark measures MB/second of caches and RAM, including skipped sequential addressing to see where data is read in bursts. Example results follow. Note burst reading effects and reading to two different registers (Reg2, from assembly code version) can be faster than to 1. Then, in this case, loading every word to 1 register (AndI, Reg1, Inc4 bytes) produces almost constant speeds (around 1400 MIPS). So, even a long sequence of instructions might not suit particular pipelines). The way to find out is to run a wider variation of your tests.

    ######################################################################### Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2807 MHz

             Windows Bus Speed Test Version 2.2 by Roy Longbottom
    
      Minimum      0.100 seconds per test, Start Fri Jul 30 16:43:56 2010
    
              MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
      Memory  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
      KBytes Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8
       Used   MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S
    
          4  10025 10800 11262 11498 11612 11634  5850 11635 23093 23090
          8  10807 11267 11505 11627 11694 11694  5871 11694 23299 23297
         16  11251 11488 11620 11614 11712 11719  5873 11718 23391 23398
         32   9893  9853 10890 11170 11558 11492  5872 11466 21032 21025
         64   3219  4620  7289  9479 10805 10805  5875 10797 14426 14426
        128   3213  4805  7305  9467 10811 10810  5875 10805 14442 14408
        256   3144  4592  7231  9445 10759 10733  5870 10743 14336 14337
        512   2005  3497  5980  9056 10466 10467  5871 10441 13906 13905
       1024   2003  3482  5974  9017 10468 10466  5874 10467 13896 13818
       2048   2004  3497  5958  9088 10447 10448  5870 10447 13857 13857
       4096   1963  3398  5778  8870 10328 10328  5851 10328 13591 13630
       8192   1729  3045  5322  8270  9977  9963  5728  9965 12923 12892
      16384    692  1402  2495  4593  7811  7782  5406  7848  8335  8337
      32768    695  1406  2492  4584  7820  7826  5401  7792  8317  8322
      65536    695  1414  2488  4584  7823  7826  5403  7800  8321  8321
     131072    696  1402  2491  4575  7827  7824  5411  7846  8322  8323
     262144    696  1413  2498  4594  7791  7826  5409  7829  8333  8334
     524288    693  1416  2498  4595  7841  7842  5411  7847  8319  8285
    1048576    704  1415  2478  4591  7845  7840  5410  7853  8290  8283
    
                      End of test Fri Jul 30 16:44:29 2010
    

    MM uses 1 and 8 MMX registers, later versions use SSE

    Source codes and execution files are free for anyone to play with. Files are in following where array declarations are shown:

    Windows http://www.roylongbottom.org.uk/busspd2k.zip

     xx = (int *)VirtualAlloc(NULL, useMemK*1024+256, MEM_COMMIT, PAGE_READWRITE);
    

    Linux http://www.roylongbottom.org.uk/memory_benchmarks.tar.gz

    #ifdef Bits64
       array = (long long *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
    #else
       array = (int *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
    

    Results and other links (MP version, Android) are in:

    http://www.roylongbottom.org.uk/busspd2k%20results.htm

    0 讨论(0)
  • 2021-02-04 20:34

    I don't get constant time. I modified your code a bit to make it simpler. My times are much lower than yours. I'm not sure why. The large times at the beginning make sense because there are only a few values to write to so it's a dependency chain. The L2 Cache ends at 256k/4=64k. Notice how the values start rising between size=32768 and 65536.

    //GCC -O3 Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
    Size=1, Iterations=1073741824, Time=187.18 ms
    Size=2, Iterations=536870912, Time=113.47 ms
    Size=4, Iterations=268435456, Time=50.53 ms
    Size=8, Iterations=134217728, Time=25.02 ms
    Size=16, Iterations=67108864, Time=25.61 ms
    Size=32, Iterations=33554432, Time=24.08 ms
    Size=64, Iterations=16777216, Time=22.69 ms
    Size=128, Iterations=8388608, Time=22.03 ms
    Size=256, Iterations=4194304, Time=19.98 ms
    Size=512, Iterations=2097152, Time=17.09 ms
    Size=1024, Iterations=1048576, Time=15.66 ms
    Size=2048, Iterations=524288, Time=14.94 ms
    Size=4096, Iterations=262144, Time=14.58 ms
    Size=8192, Iterations=131072, Time=14.40 ms
    Size=16384, Iterations=65536, Time=14.63 ms
    Size=32768, Iterations=32768, Time=14.75 ms
    Size=65536, Iterations=16384, Time=18.58 ms
    Size=131072, Iterations=8192, Time=20.51 ms
    Size=262144, Iterations=4096, Time=21.18 ms
    Size=524288, Iterations=2048, Time=21.26 ms
    Size=1048576, Iterations=1024, Time=21.22 ms
    Size=2097152, Iterations=512, Time=22.17 ms
    Size=4194304, Iterations=256, Time=38.01 ms
    Size=8388608, Iterations=128, Time=38.63 ms
    Size=16777216, Iterations=64, Time=38.09 ms
    Size=33554432, Iterations=32, Time=38.54 ms
    Size=67108864, Iterations=16, Time=39.11 ms
    Size=134217728, Iterations=8, Time=39.96 ms
    Size=268435456, Iterations=4, Time=42.15 ms
    Size=536870912, Iterations=2, Time=46.39 ms
    

    The code:

    #include <stdio.h>
    #include <omp.h>
    
    static void test_function(int n, int iterations)
    {
        int *array = new int[n];
        for (int i = 0; i < iterations; i++)
            for (int x = 0; x < n; x++)
              array[x]++;
        delete[] array;
    }
    
    int main() {        
        for(int i=0, n=1, iterations=1073741824; i<30; i++, n*=2, iterations/=2) {
            double dtime;
            dtime = omp_get_wtime();
            test_function(n, iterations);
            dtime = omp_get_wtime() - dtime;
            printf("Size=%d, Iterations=%d, Time=%.3f\n", n, iterations, dtime);
        }
    }
    
    0 讨论(0)
  • 2021-02-04 20:36

    TL;DR: Your test is not correct test for cache latency or speed. Instead it measures some problems of chopping complex code through OoO CPU pipeline.

    Use right tests for measuring cache and memory latency: lat_mem_rd from lmbench; and right tests for speed (bandwidth) measurements: STREAM benchmark for memory speed; tests from memtest86 for cache speed with rep movsl main operation)

    Also, in modern (2010 and newer) desktop/sever CPUs there is hardware prefetch logic built in near L1 and L2 caches which will detect linear access pattern and preload data from outer caches into inner before you will ask for this data: Intel Optimization Manual - 7.2 Hardware prefetching of data, page 365; intel.com blog, 2009. It is hard to disable all hardware prefetches (SO Q/A 1, SO Q/A 2)

    Long story:

    I will try to do several measurements of similar test with perf performance monitoring tool in Linux (aka perf_events). The code is based on program from Joky (array of 32-bit ints, not of chars), and was separated into several binaries as: a5 is for size 2^5 = 32; a10 => 2^10 = 1024 (4 KB); a15 => 2^15 = 32768, a20 (1 million of ints = 4 MB) and a25 (32 millions of ints = 128MB). The cpu is i7-2600 quad-core Sandy Bridge 3.4 GHz.

    Let's start with basic perf stat with default event set (some lines are skipped). I selected 2^10 (4 KB) and 2^20 (4 MB)

    $ perf stat ./a10
    Size=1024 ITERATIONS=1048576, TIME=2372.09 ms
    
     Performance counter stats for './a10':
    
                   276 page-faults               #    0,000 M/sec
         8 238 473 169 cycles                    #    3,499 GHz
         4 936 244 310 stalled-cycles-frontend   #   59,92% frontend cycles idle
           415 849 629 stalled-cycles-backend    #    5,05% backend  cycles idle
        11 832 421 238 instructions              #    1,44  insns per cycle
                                                 #    0,42  stalled cycles per insn
         1 078 974 782 branches                  #  458,274 M/sec
             1 080 091 branch-misses             #    0,10% of all branches
    
    $ perf stat ./a20
    Size=1048576 ITERATIONS=1024, TIME=2432.4 ms
    
     Performance counter stats for './a20':
    
                 2 321 page-faults               #    0,001 M/sec
         8 487 656 735 cycles                    #    3,499 GHz
         5 184 295 720 stalled-cycles-frontend   #   61,08% frontend cycles idle
           663 245 253 stalled-cycles-backend    #    7,81% backend  cycles idle
        11 836 712 988 instructions              #    1,39  insns per cycle
                                                 #    0,44  stalled cycles per insn
         1 077 257 745 branches                  #  444,104 M/sec
                30 601 branch-misses             #    0,00% of all branches
    

    What we can see here? Instruction counts are very close (because Size*Iterations is constant), cycle count and time are close too. Both examples have 1 billion branches with 99% good prediction. But there is very high 60% stall count for frontend and 5-8% for backend. Frontend stalls are stalls in the instruction fetch and decode, it can be hard to tell why, but for your code frontend can't decode 4 instructions per tick (page B-41 of Intel optimisation manual, section B.3 - "Performance tuning techniques for ... Sandy Bridge", subsection B.3.2 Hierarchical Top-Down Performance Characterization ...)

    $ perf record -e stalled-cycles-frontend ./a20
    Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
    $ perf annotate -d a20|cat
     Percent |      Source code & Disassembly of a20
    ------------------------------------------------
             :      08048e6f <void test_body<1048576u, 1024u>(int volatile*)>:
    
       10.43 :       8048e87:       mov    -0x8(%ebp),%eax
        1.10 :       8048e8a:       lea    0x0(,%eax,4),%edx
        0.16 :       8048e91:       mov    0x8(%ebp),%eax
        0.78 :       8048e94:       add    %edx,%eax
        6.87 :       8048e96:       mov    (%eax),%edx
       52.53 :       8048e98:       add    $0x1,%edx
        9.89 :       8048e9b:       mov    %edx,(%eax)
       14.15 :       8048e9d:       addl   $0x1,-0x8(%ebp)
        2.66 :       8048ea1:       mov    -0x8(%ebp),%eax
        1.39 :       8048ea4:       cmp    $0xfffff,%eax
    

    Or here with raw opcodes (objdump -d), some have rather complicated indexing, so possible they can't be handled by 3 simple decoders and waits for the only complex decoder (image is there: http://www.realworldtech.com/sandy-bridge/4/)

     8048e87:       8b 45 f8                mov    -0x8(%ebp),%eax
     8048e8a:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
     8048e91:       8b 45 08                mov    0x8(%ebp),%eax
     8048e94:       01 d0                   add    %edx,%eax
     8048e96:       8b 10                   mov    (%eax),%edx
     8048e98:       83 c2 01                add    $0x1,%edx
     8048e9b:       89 10                   mov    %edx,(%eax)
     8048e9d:       83 45 f8 01             addl   $0x1,-0x8(%ebp)
     8048ea1:       8b 45 f8                mov    -0x8(%ebp),%eax
     8048ea4:       3d ff ff 0f 00          cmp    $0xfffff,%eax
    

    Backend stalls are stalls created by waiting for memory or cache (the thing you are interested in when measuring caches) and by internal execution core stalls:

    $ perf record -e stalled-cycles-backend ./a20
    Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
    $ perf annotate -d a20|cat
        4.25 :       8048e96:       mov    (%eax),%edx
       58.68 :       8048e98:       add    $0x1,%edx
        8.86 :       8048e9b:       mov    %edx,(%eax)
        3.94 :       8048e9d:       addl   $0x1,-0x8(%ebp)
        7.66 :       8048ea1:       mov    -0x8(%ebp),%eax
        7.40 :       8048ea4:       cmp    $0xfffff,%eax
    

    Most backend stalls are reported for add 0x1,%edx because it is the consumer of data, loaded from the array in previous command. With store to array they account for 70% of backend stalls, or if we multiply if for total backend stall portion in the program (7%), for the 5% of all stalls. Or in other words, the cache is faster than your program. Now we can answer to your first question:

    Why the time taken does not increase at all regardless of the size of my array?

    You test is so bad (not optimized), that you are trying to measure caches, but they have only 5% slowdown on total run time. Your test is so unstable (noisy) that you will not see this 5% effect.

    With custom perf stat runs we also can measure cache request-to-miss ratio. For 4 KB program L1 data cache serves 99,99% of all loads and 99,999% of all stores. We can note that your incorrect test generate several times more requests to cache than it is needed to walk on array and increment every element (1 billion loads + 1 billion stores). Additional accesses are for working with local variables like x, they always served by cache because their address is constant)

    $ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
    Size=1024 ITERATIONS=1048576, TIME=2412.25 ms
    
     Performance counter stats for './a10':
    
         5 375 195 765 L1-dcache-loads
               364 140 L1-dcache-load-misses     #    0,01% of all L1-dcache hits
         2 151 408 053 L1-dcache-stores
                13 350 L1-dcache-store-misses
    

    For 4 MB program hit rate is many-many times worse. 100 times more misses! Now 1.2 % of all memory requests are served not by L1 but L2.

    $ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
    Size=1048576 ITERATIONS=1024, TIME=2443.92 ms
    
     Performance counter stats for './a20':
    
         5 378 035 007 L1-dcache-loads
            67 725 008 L1-dcache-load-misses     #    1,26% of all L1-dcache hits
         2 152 183 588 L1-dcache-stores
            67 266 426 L1-dcache-store-misses
    

    Isn't it a case when we want to notice how cache latency goes from 4 cpu ticks up to 12 (3 times longer), and when this change affects only 1.2% of cache requests, and when our program has only 7% slowdown sensitive to the cache latencies ???

    What if we will use bigger data set? Ok, here is a25 (2^25 of 4-byte ints = 128 MB, several times more than cache size):

    $ perf stat   ./a25
    Size=134217728 ITERATIONS=8, TIME=2437.25 ms
    
     Performance counter stats for './a25':
    
               262 417 page-faults               #    0,090 M/sec
        10 214 588 827 cycles                    #    3,499 GHz
         6 272 114 853 stalled-cycles-frontend   #   61,40% frontend cycles idle
         1 098 632 880 stalled-cycles-backend    #   10,76% backend  cycles idle
        13 683 671 982 instructions              #    1,34  insns per cycle
                                                 #    0,46  stalled cycles per insn
         1 274 410 549 branches                  #  436,519 M/sec
               315 656 branch-misses             #    0,02% of all branches
    
    $ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
    Size=134217728 ITERATIONS=8, TIME=2444.13 ms
    
     Performance counter stats for './a25':
    
         6 138 410 226 L1-dcache-loads
            77 025 747 L1-dcache-load-misses     #    1,25% of all L1-dcache hits
         2 515 141 824 L1-dcache-stores
            76 320 695 L1-dcache-store-misses
    

    Almost the same L1 miss rate, and more backend stalls. I was able to get stats on "cache-references,cache-misses" events ans I suggest they are about L3 cache (there is several times more requests to L2):

    $ perf stat -e 'cache-references,cache-misses' ./a25
    Size=134217728 ITERATIONS=8, TIME=2440.71 ms
    
     Performance counter stats for './a25':
    
            17 053 482 cache-references
            11 829 118 cache-misses              #   69,365 % of all cache refs
    

    So, miss rate is high, but the test does 1 billion of (useful) loads, and only 0.08 billion of them misses L1. 0.01 billion of requests are served by memory. Memory latency is around 230 cpu clocks instead of 4 clock L1 latency. Is the test able to see this? May be, if the noise is low.

    0 讨论(0)
提交回复
热议问题