Poor memcpy Performance on Linux

后端未结

关注

 7  1894

执笔经年 2021-01-29 21:04

We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.

7条回答

终归单人心 (楼主)

2021-01-29 21:30
The numbers make sense to me. There are actually two questions here, and I'll answer them both.

First though, we need to have a mental model of how large¹ memory transfers work on something like a modern Intel processor. This description is approximate and the details may change somewhat from architecture to architecture, but the high level ideas are quite constant.
1. When a load misses in the L1 data cache, a line buffer is allocated which will track the miss request until it is filled. This may be for a short time (a dozen cycles or so) if it hits in the L2 cache, or much longer (100+ nanoseconds) if it misses all the way to DRAM.
2. There are a limited number of these line buffers per core¹, and once they are full, further misses will stall waiting for one.
3. Other than these fill buffers used for demand³ loads/stores there are additional buffers for memory movement between DRAM and L2 and lower level caches used by prefetching.
4. The memory subsystem itself has a maximum bandwidth limit, which you'll find conveniently listed on ARK. For example, the 3720QM in the Lenovo laptop shows a limit of 25.6 GB. This limit is basically the product of the effective frequency (1600 Mhz) times 8 bytes (64-bits) per transfer times the number of channels (2): 1600 * 8 * 2 = 25.6 GB/s. The server chip on the hand has a peak bandwidth of 51.2 GB/s, per socket, for a total system bandwidth of ~102 GB/s.
  
  Unlike other processor features, there are often only a possible theoretical bandwidth numbers across the whole variety of chips, since it depends only on the noted values which are often the same across many different chips, and even across architectures. It is unrealistic expect DRAM to deliver at exactly the theoretic rate (due to various low level concerns, discussed a bit here), but you can often get around 90% or more.
So the primary consequence of (1) is that you can treat misses to RAM as a kind of request response system. A miss to DRAM allocates a fill buffer and the buffer is released when the request comes back. There are only 10 of these buffers, per CPU, for demand misses, which puts a strict limit on the demand memory bandwidth a single CPU can generate, as a function of its latency.

For example, lets say your E5-2680 has a latency to DRAM of 80ns. Every request brings in a 64 byte cache line, so you just issued requests serially to DRAM you'd expect a throughput of a paltry 64 bytes / 80 ns = 0.8 GB/s, and you'd cut that in half again (at least) to get a memcpy figure since it needs to read and write. Luckily, you can your 10 line-fill buffers, so you can overlap 10 concurrent requests to memory and increase the bandwidth by a factor of 10, leading to a theoretical bandwidth of 8 GB/s.

If you want to dig into even more details, this thread is pretty much pure gold. You'll find that facts and figures from John McCalpin, aka "Dr Bandwidth will be a common theme below.

So let's get into the details and answer the two questions...

Why is memcpy so much slower than memmove or hand rolled copy on the server?

You showed that you the laptop systems do the memcpy benchmark in about 120 ms, while the server parts take around 300 ms. You also showed that this slowness mostly is not fundamental since you were able to use memmove and your hand-rolled-memcpy (hereafter, hrm) to achieve a time of about 160 ms, much closer (but still slower than) the laptop performance.

We already showed above that for a single core, the bandwidth is limited by the total available concurrency and latency, rather than the DRAM bandwidth. We expect that the server parts may have a longer latency, but not 300 / 120 = 2.5x longer!

The answer lies in streaming (aka non-temporal) stores. The libc version of memcpy you are using uses them, but memmove does not. You confirmed as much with your "naive" memcpy which also doesn't use them, as well as my configuring asmlib both to use the streaming stores (slow) and not (fast).

The streaming stores hurt the single CPU numbers because:
- (A) They prevent prefetching from bringing in the to-be-stored lines into the cache, which allows more concurrency since the prefetching hardware has other dedicated buffers beyond the 10 fill buffers that demand load/stores use.
- (B) The E5-2680 is known to be particularly slow for streaming stores.
Both issues are better explained by quotes from John McCalpin in the above linked thread. On the topic of prefetch effectiveness and streaming stores he says:

With "ordinary" stores, L2 hardware prefetcher can fetch lines in advance and reduce the time that the Line Fill Buffers are occupied, thus increasing sustained bandwidth. On the other hand, with streaming (cache-bypassing) stores, the Line Fill Buffer entries for the stores are occupied for the full time required to pass the data to the DRAM controller. In this case, the loads can be accelerated by hardware prefetching, but the stores cannot, so you get some speedup, but not as much as you would get if both loads and stores were accelerated.

... and then for the apparently much longer latency for streaming stores on the E5, he says:

The simpler "uncore" of the Xeon E3 could lead to significantly lower Line Fill Buffer occupancy for streaming stores. The Xeon E5 has a much more complex ring structure to navigate in order to hand off the streaming stores from the core buffers to the memory controllers, so the occupancy might differ by a larger factor than the memory (read) latency.

In particular, Dr. McCalpin measured a ~1.8x slowdown for E5 compared to a chip with the "client" uncore, but the 2.5x slowdown the OP reports is consistent with that since the 1.8x score is reported for STREAM TRIAD, which has a 2:1 ratio of loads:stores, while memcpy is at 1:1, and the stores are the problematic part.

This doesn't make streaming a bad thing - in effect, you are trading off latency for smaller total bandwidth consumption. You get less bandwidth because you are concurrency limited when using a single core, but you avoid all the read-for-ownership traffic, so you would likely see a (small) benefit if you ran the test simultaneously on all cores.

So far from being an artifact of your software or hardware configuration, the exact same slowdowns have been reported by other users, with the same CPU.

Why is the server part still slower when using ordinary stores?

Even after correcting the non-temporal store issue, you are still seeing roughly a 160 / 120 = ~1.33x slowdown on the server parts. What gives?

Well it's a common fallacy that server CPUs are faster in all respects faster or at least equal to their client counterparts. It's just not true - what you are paying for (often at $2,000 a chip or so) on the server parts is mostly (a) more cores (b) more memory channels (c) support for more total RAM (d) support for "enterprise-ish" features like ECC, virutalization features, etc⁵.

In fact, latency-wise, server parts are usually only equal or slower to their client⁴ parts. When it comes to memory latency, this is especially true, because:
- The server parts have a more scalable, but complex "uncore" that often needs to support many more cores and consequently the path to RAM is longer.
- The server parts support more RAM (100s of GB or a few TB) which often requires electrical buffers to support such a large quantity.
- As in the OP's case server parts are usually multi-socket, which adds cross-socket coherence concerns to the memory path.
So it is typical that server parts have a latency 40% to 60% longer than client parts. For the E5 you'll probably find that ~80 ns is a typical latency to RAM, while client parts are closer to 50 ns.

So anything that is RAM latency constrained will run slower on server parts, and as it turns out, memcpy on a single core is latency constrained. that's confusing because memcpy seems like a bandwidth measurement, right? Well as described above, a single core doesn't have enough resources to keep enough requests to RAM in flight at a time to get close to the RAM bandwidth⁶, so performance depends directly on latency.

The client chips, on the other hand, have both lower latency and lower bandwidth, so one core comes much closer to saturating the bandwidth (this is often why streaming stores are a big win on client parts - when even a single core can approach the RAM bandwidth, the 50% store bandwidth reduction that stream stores offers helps a lot.

References

There are lots of good sources to read more on this stuff, here are a couple.
- A detailed description of memory latency components
- Lots of memory latency results across CPUs new and old (see the MemLatX86 and NewMemLat) links
- Detailed analysis of Sandy Bridge (and Opteron) memory latencies - almost the same chip the OP is using.
¹ By large I just mean somewhat larger than the LLC. For copies that fit in the LLC (or any higher cache level) the behavior is very different. The OPs llcachebench graph shows that in fact the performance deviation only starts when the buffers start to exceed the LLC size.

² In particular, the number of line fill buffers has apparently been constant at 10 for several generations, including the architectures mentioned in this question.

³ When we say demand here, we mean that it is associated with an explicit load/store in the code, rather than say being brought in by a prefetch.

⁴ When I refer to a server part here, I mean a CPU with a server uncore. This largely means the E5 series, as the E3 series generally uses the client uncore.

⁵ In the future, it looks like you can add "instruction set extensions" to this list, as it seems that AVX-512 will appear only on the Skylake server parts.

⁶ Per little's law at a latency of 80 ns, we'd need (51.2 B/ns * 80 ns) == 4096 bytes or 64 cache lines in flight at all times to reach the maximum bandwidth, but one core provides less than 20.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

Poor memcpy Performance on Linux

Why is memcpy so much slower than memmove or hand rolled copy on the server?

Why is the server part still slower when using ordinary stores?

References