Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

前端未结

关注

 2  369

I\'m seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that\'s always to the same location

相关标签:

2条回答

你的背包

2020-11-29 06:25
What I've found so far. Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it:
- The store throughput into L2 appears to be at most one 64-byte cache-line per three cycles⁰, putting a ~21 bytes per cycle upper limit on store throughput. Said another way, series of stores that miss in L1 and hit in L2 will take at least three cycles per cache line touched.
- Above that baseline there is a significant penalty when stores that hit in L2 are interleaved with stores to a different cache line (regardless of whether those stores hit in L1 or L2).
- The penalty is apparently somewhat larger for stores that are nearby (but still not in the same cache line).
- The bimodal performance is at least superficially related to above effect since in the non-interleaving case it does not appear to occur, although I don't have a further explanation for it.
- If you ensure the cache line is already in L1 before the store, by prefetch or a dummy load, the slow performance disappears and the performance is no longer bimodal.
Details and Pictures

64-byte Stride

The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, i.e., one full cache line. As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables.

Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory:
```
top:
mov    BYTE PTR [rdx],al
add    rdx,0x40
sub    rdi,0x1
jne    top
```
Running this in the same harness as above, I get about 3.05 cycles/store², although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there).

So we know already we probably aren't going to do better than this for sustained stores purely to L2¹. While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1.

What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? Like this:
```
top:
mov    BYTE PTR [rdx],al
mov    BYTE PTR [rdx+0x1],al
add    rdx,0x40
sub    rdi,0x1
jne    top
```
Here's a histogram of the timing for 1000 runs of the test harness for the above loop:
```
  count   cycles/itr
      1   3.0
     51   3.1
      5   3.2
      5   3.3
     12   3.4
    733   3.5
    139   3.6
     22   3.7
      2   3.8
     11   4.0
     16   4.1
      1   4.3
      2   4.4
```
So the majority of times are clustered around 3.5 cycles. That means that this additional store only added 0.5 cycles to the timing. It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time.

Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3 where 1 indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1 maybe the stores drain in one cycle, but if it's at 1, 2 it takes two cycles.

Note there is another peak of about 6% of results around 3.1 rather than 3.5. That could be a steady state where we always get the lucky outcome. There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement.

Let's test this theory by looking at various offsets between the first and second stores:
```
top:
mov    BYTE PTR [rdx + FIRST],al
mov    BYTE PTR [rdx + SECOND],al
add    rdx,0x40
sub    rdi,0x1
jne    top
```
We try all values of FIRST and SECOND from 0 to 256 in steps of 8. The results, with varying FIRST values on the vertical axis and SECOND on the horizontal:

We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). We notice the following:
- From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. This is consistent with the idea above that stores to the same cache line are handled more efficiently. The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - where in the second case it is the second store that first touches each cache line. The store buffer doesn't care though. As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ... and apparently this sucks?
- The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles).
We can zoom out a bit and look at even larger offsets:

The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5.

If you add any kind of load or prefetch instruction that runs ahead³ of the stores, both the overall slow performance and the "slow mode" outliers disappear:

You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops.

So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. That's contrary to the idea that a linear pattern like this never benefits from SW prefetch.

I don't really have a fleshed out explanation, but it could include these factors:
- Having other stores in the store buffers may reduce the concurrency of the requests going to L2. It isn't clear exactly when stores that are going to miss in L1 allocate a store buffer, but perhaps it occurs near when the store is going to retire and there is a certain amount of "lookhead" into the store buffer to bring locations into L1, so having additional stores that aren't going to miss in L1 hurts the concurrency since the lookahead can't see as many requests that will miss.
- Perhaps there are conflicts for L1 and L2 resources like read and write ports, inter-cache bandwidth, that are worse with this pattern of stores. For example when stores to different lines interleave, maybe they cannot drain as quickly from the store queue (see above where it appears that in some scenarios more than one store may drain per cycle).
These comments by Dr. McCalpin on the Intel forums are also quite interesting.

⁰ Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles.

¹ Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions.

² With prefetching off. As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability).

³ It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-11-29 06:41

Sandy Bridge has "L1 data hardware pre-fetchers". What this means is that initially when you do your store the CPU has to fetch data from L2 into L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store.

0 讨论(0)
发布评论:

提交评论
- 加载中...

Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

Details and Pictures

64-byte Stride