C++ Cache performance odd behavior

前端 未结 4 592
天涯浪人
天涯浪人 2021-02-04 19:44

I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They

4条回答
  •  温柔的废话
    2021-02-04 20:15

    It seems clear that constant time implies a constant instruction execution rate. To measure cache/RAM speed, data transfer instructions should predominate and results require further clarification than run time, like MB/second and instructions per second. You need something like my BusSpeed benchmark (Google for Roy BusSpeed benchmark or BusSpd2k for source codes and results with versions for Windows, Linux and Android). The original used assembly code with instructions like:

       "add     edx,ecx"     \
       "mov     ebx,[edi]"   \
       "mov     ecx,ebx"     \
    "lp: and     ebx,[edx]"   \
       "and     ecx,[edx+4]"   \
       "and     ebx,[edx+8]"   \
       "and     ecx,[edx+12]"   \
       "and     ebx,[edx+16]"   \
       "and     ecx,[edx+20]"   \
       "and     ebx,[edx+24]"   \
       "and     ecx,[edx+28]"   \
       "and     ebx,[edx+32]"   \
       "and     ecx,[edx+36]"   \
       "and     ebx,[edx+40]"   \
    
     To
    
       "and     ecx,[edx+236]"   \
       "and     ebx,[edx+240]"   \
       "and     ecx,[edx+244]"   \
       "and     ebx,[edx+248]"   \
       "and     ecx,[edx+252]"   \
       "add     edx,256"     \
       "dec     eax"         \
       "jnz     lp"          \
       "and     ebx,ecx"     \
       "mov     [edi],ebx"     \             
    

    Later versions used C as follows

    void inc1word()
    {
       int i, j;
    
       for(j=0; j

    The benchmark measures MB/second of caches and RAM, including skipped sequential addressing to see where data is read in bursts. Example results follow. Note burst reading effects and reading to two different registers (Reg2, from assembly code version) can be faster than to 1. Then, in this case, loading every word to 1 register (AndI, Reg1, Inc4 bytes) produces almost constant speeds (around 1400 MIPS). So, even a long sequence of instructions might not suit particular pipelines). The way to find out is to run a wider variation of your tests.

    ######################################################################### Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2807 MHz

             Windows Bus Speed Test Version 2.2 by Roy Longbottom
    
      Minimum      0.100 seconds per test, Start Fri Jul 30 16:43:56 2010
    
              MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
      Memory  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
      KBytes Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8
       Used   MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S
    
          4  10025 10800 11262 11498 11612 11634  5850 11635 23093 23090
          8  10807 11267 11505 11627 11694 11694  5871 11694 23299 23297
         16  11251 11488 11620 11614 11712 11719  5873 11718 23391 23398
         32   9893  9853 10890 11170 11558 11492  5872 11466 21032 21025
         64   3219  4620  7289  9479 10805 10805  5875 10797 14426 14426
        128   3213  4805  7305  9467 10811 10810  5875 10805 14442 14408
        256   3144  4592  7231  9445 10759 10733  5870 10743 14336 14337
        512   2005  3497  5980  9056 10466 10467  5871 10441 13906 13905
       1024   2003  3482  5974  9017 10468 10466  5874 10467 13896 13818
       2048   2004  3497  5958  9088 10447 10448  5870 10447 13857 13857
       4096   1963  3398  5778  8870 10328 10328  5851 10328 13591 13630
       8192   1729  3045  5322  8270  9977  9963  5728  9965 12923 12892
      16384    692  1402  2495  4593  7811  7782  5406  7848  8335  8337
      32768    695  1406  2492  4584  7820  7826  5401  7792  8317  8322
      65536    695  1414  2488  4584  7823  7826  5403  7800  8321  8321
     131072    696  1402  2491  4575  7827  7824  5411  7846  8322  8323
     262144    696  1413  2498  4594  7791  7826  5409  7829  8333  8334
     524288    693  1416  2498  4595  7841  7842  5411  7847  8319  8285
    1048576    704  1415  2478  4591  7845  7840  5410  7853  8290  8283
    
                      End of test Fri Jul 30 16:44:29 2010
    

    MM uses 1 and 8 MMX registers, later versions use SSE

    Source codes and execution files are free for anyone to play with. Files are in following where array declarations are shown:

    Windows http://www.roylongbottom.org.uk/busspd2k.zip

     xx = (int *)VirtualAlloc(NULL, useMemK*1024+256, MEM_COMMIT, PAGE_READWRITE);
    

    Linux http://www.roylongbottom.org.uk/memory_benchmarks.tar.gz

    #ifdef Bits64
       array = (long long *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
    #else
       array = (int *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
    

    Results and other links (MP version, Android) are in:

    http://www.roylongbottom.org.uk/busspd2k%20results.htm

提交回复
热议问题