C++ Cache performance odd behavior

前端未结

关注

 4  590

天涯浪人 2021-02-04 19:44

I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They

4条回答

被撕碎了的回忆 (楼主)

2021-02-04 20:36

TL;DR: Your test is not correct test for cache latency or speed. Instead it measures some problems of chopping complex code through OoO CPU pipeline.

Use right tests for measuring cache and memory latency: lat_mem_rd from lmbench; and right tests for speed (bandwidth) measurements: STREAM benchmark for memory speed; tests from memtest86 for cache speed with rep movsl main operation)

Also, in modern (2010 and newer) desktop/sever CPUs there is hardware prefetch logic built in near L1 and L2 caches which will detect linear access pattern and preload data from outer caches into inner before you will ask for this data: Intel Optimization Manual - 7.2 Hardware prefetching of data, page 365; intel.com blog, 2009. It is hard to disable all hardware prefetches (SO Q/A 1, SO Q/A 2)

Long story:

I will try to do several measurements of similar test with perf performance monitoring tool in Linux (aka perf_events). The code is based on program from Joky (array of 32-bit ints, not of chars), and was separated into several binaries as: a5 is for size 2^5 = 32; a10 => 2^10 = 1024 (4 KB); a15 => 2^15 = 32768, a20 (1 million of ints = 4 MB) and a25 (32 millions of ints = 128MB). The cpu is i7-2600 quad-core Sandy Bridge 3.4 GHz.

Let's start with basic perf stat with default event set (some lines are skipped). I selected 2^10 (4 KB) and 2^20 (4 MB)

$ perf stat ./a10
Size=1024 ITERATIONS=1048576, TIME=2372.09 ms

 Performance counter stats for './a10':

               276 page-faults               #    0,000 M/sec
     8 238 473 169 cycles                    #    3,499 GHz
     4 936 244 310 stalled-cycles-frontend   #   59,92% frontend cycles idle
       415 849 629 stalled-cycles-backend    #    5,05% backend  cycles idle
    11 832 421 238 instructions              #    1,44  insns per cycle
                                             #    0,42  stalled cycles per insn
     1 078 974 782 branches                  #  458,274 M/sec
         1 080 091 branch-misses             #    0,10% of all branches

$ perf stat ./a20
Size=1048576 ITERATIONS=1024, TIME=2432.4 ms

 Performance counter stats for './a20':

             2 321 page-faults               #    0,001 M/sec
     8 487 656 735 cycles                    #    3,499 GHz
     5 184 295 720 stalled-cycles-frontend   #   61,08% frontend cycles idle
       663 245 253 stalled-cycles-backend    #    7,81% backend  cycles idle
    11 836 712 988 instructions              #    1,39  insns per cycle
                                             #    0,44  stalled cycles per insn
     1 077 257 745 branches                  #  444,104 M/sec
            30 601 branch-misses             #    0,00% of all branches

What we can see here? Instruction counts are very close (because Size*Iterations is constant), cycle count and time are close too. Both examples have 1 billion branches with 99% good prediction. But there is very high 60% stall count for frontend and 5-8% for backend. Frontend stalls are stalls in the instruction fetch and decode, it can be hard to tell why, but for your code frontend can't decode 4 instructions per tick (page B-41 of Intel optimisation manual, section B.3 - "Performance tuning techniques for ... Sandy Bridge", subsection B.3.2 Hierarchical Top-Down Performance Characterization ...)

$ perf record -e stalled-cycles-frontend ./a20
Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
$ perf annotate -d a20|cat
 Percent |      Source code & Disassembly of a20
------------------------------------------------
         :      08048e6f (int volatile*)>:

   10.43 :       8048e87:       mov    -0x8(%ebp),%eax
    1.10 :       8048e8a:       lea    0x0(,%eax,4),%edx
    0.16 :       8048e91:       mov    0x8(%ebp),%eax
    0.78 :       8048e94:       add    %edx,%eax
    6.87 :       8048e96:       mov    (%eax),%edx
   52.53 :       8048e98:       add    $0x1,%edx
    9.89 :       8048e9b:       mov    %edx,(%eax)
   14.15 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    2.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    1.39 :       8048ea4:       cmp    $0xfffff,%eax

Or here with raw opcodes (objdump -d), some have rather complicated indexing, so possible they can't be handled by 3 simple decoders and waits for the only complex decoder (image is there: http://www.realworldtech.com/sandy-bridge/4/)

 8048e87:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048e8a:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
 8048e91:       8b 45 08                mov    0x8(%ebp),%eax
 8048e94:       01 d0                   add    %edx,%eax
 8048e96:       8b 10                   mov    (%eax),%edx
 8048e98:       83 c2 01                add    $0x1,%edx
 8048e9b:       89 10                   mov    %edx,(%eax)
 8048e9d:       83 45 f8 01             addl   $0x1,-0x8(%ebp)
 8048ea1:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048ea4:       3d ff ff 0f 00          cmp    $0xfffff,%eax

Backend stalls are stalls created by waiting for memory or cache (the thing you are interested in when measuring caches) and by internal execution core stalls:

$ perf record -e stalled-cycles-backend ./a20
Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
$ perf annotate -d a20|cat
    4.25 :       8048e96:       mov    (%eax),%edx
   58.68 :       8048e98:       add    $0x1,%edx
    8.86 :       8048e9b:       mov    %edx,(%eax)
    3.94 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    7.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    7.40 :       8048ea4:       cmp    $0xfffff,%eax

Most backend stalls are reported for add 0x1,%edx because it is the consumer of data, loaded from the array in previous command. With store to array they account for 70% of backend stalls, or if we multiply if for total backend stall portion in the program (7%), for the 5% of all stalls. Or in other words, the cache is faster than your program. Now we can answer to your first question:

Why the time taken does not increase at all regardless of the size of my array?

You test is so bad (not optimized), that you are trying to measure caches, but they have only 5% slowdown on total run time. Your test is so unstable (noisy) that you will not see this 5% effect.

With custom perf stat runs we also can measure cache request-to-miss ratio. For 4 KB program L1 data cache serves 99,99% of all loads and 99,999% of all stores. We can note that your incorrect test generate several times more requests to cache than it is needed to walk on array and increment every element (1 billion loads + 1 billion stores). Additional accesses are for working with local variables like x, they always served by cache because their address is constant)

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
Size=1024 ITERATIONS=1048576, TIME=2412.25 ms

 Performance counter stats for './a10':

     5 375 195 765 L1-dcache-loads
           364 140 L1-dcache-load-misses     #    0,01% of all L1-dcache hits
     2 151 408 053 L1-dcache-stores
            13 350 L1-dcache-store-misses

For 4 MB program hit rate is many-many times worse. 100 times more misses! Now 1.2 % of all memory requests are served not by L1 but L2.

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
Size=1048576 ITERATIONS=1024, TIME=2443.92 ms

 Performance counter stats for './a20':

     5 378 035 007 L1-dcache-loads
        67 725 008 L1-dcache-load-misses     #    1,26% of all L1-dcache hits
     2 152 183 588 L1-dcache-stores
        67 266 426 L1-dcache-store-misses

Isn't it a case when we want to notice how cache latency goes from 4 cpu ticks up to 12 (3 times longer), and when this change affects only 1.2% of cache requests, and when our program has only 7% slowdown sensitive to the cache latencies ???

What if we will use bigger data set? Ok, here is a25 (2^25 of 4-byte ints = 128 MB, several times more than cache size):

$ perf stat   ./a25
Size=134217728 ITERATIONS=8, TIME=2437.25 ms

 Performance counter stats for './a25':

           262 417 page-faults               #    0,090 M/sec
    10 214 588 827 cycles                    #    3,499 GHz
     6 272 114 853 stalled-cycles-frontend   #   61,40% frontend cycles idle
     1 098 632 880 stalled-cycles-backend    #   10,76% backend  cycles idle
    13 683 671 982 instructions              #    1,34  insns per cycle
                                             #    0,46  stalled cycles per insn
     1 274 410 549 branches                  #  436,519 M/sec
           315 656 branch-misses             #    0,02% of all branches

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2444.13 ms

 Performance counter stats for './a25':

     6 138 410 226 L1-dcache-loads
        77 025 747 L1-dcache-load-misses     #    1,25% of all L1-dcache hits
     2 515 141 824 L1-dcache-stores
        76 320 695 L1-dcache-store-misses

Almost the same L1 miss rate, and more backend stalls. I was able to get stats on "cache-references,cache-misses" events ans I suggest they are about L3 cache (there is several times more requests to L2):

$ perf stat -e 'cache-references,cache-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2440.71 ms

 Performance counter stats for './a25':

        17 053 482 cache-references
        11 829 118 cache-misses              #   69,365 % of all cache refs

So, miss rate is high, but the test does 1 billion of (useful) loads, and only 0.08 billion of them misses L1. 0.01 billion of requests are served by memory. Memory latency is around 230 cpu clocks instead of 4 clock L1 latency. Is the test able to see this? May be, if the noise is low.

0 讨论(0)

查看其它4个回答