Poor memcpy Performance on Linux

后端未结

关注

 7  1895

执笔经年 2021-01-29 21:04

We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.

7条回答

感情败类 (楼主)

2021-01-29 21:42
[I would make this a comment, but do not have enough reputation to do so.]

I have a similar system and see similar results, but can add a few data points:
- If you reverse the direction of your naive memcpy (i.e. convert to *p_dest-- = *p_src--), then you may get much worse performance than for the forward direction (~637 ms for me). There was a change in memcpy() in glibc 2.12 that exposed several bugs for calling memcpy on overlapping buffers (http://lwn.net/Articles/414467/) and I believe the issue was caused by switching to a version of memcpy that operates backwards. So, backward versus forward copies may explain the memcpy()/memmove() disparity.
- It seems to be better to not use non-temporal stores. Many optimized memcpy() implementations switch to non-temporal stores (which are not cached) for large buffers (i.e. larger than the last level cache). I tested Agner Fog's version of memcpy (http://www.agner.org/optimize/#asmlib) and found that it was approximately the same speed as the version in glibc. However, asmlib has a function (SetMemcpyCacheLimit) that allows setting the threshold above which non-temporal stores are used. Setting that limit to 8GiB (or just larger than the 1 GiB buffer) to avoid the non-temporal stores doubled performance in my case (time down to 176ms). Of course, that only matched the forward-direction naive performance, so it is not stellar.
- The BIOS on those systems allows four different hardware prefetchers to be enabled/disabled (MLC Streamer Prefetcher, MLC Spatial Prefetcher, DCU Streamer Prefetcher, and DCU IP Prefetcher). I tried disabling each, but doing so at best maintained performance parity and reduced performance for a few of the settings.
- Disabling the running average power limit (RAPL) DRAM mode has no impact.
- I have access to other Supermicro systems running Fedora 19 (glibc 2.17). With a Supermicro X9DRG-HF board, Fedora 19, and Xeon E5-2670 CPUs, I see similar performance as above. On a Supermicro X10SLM-F single socket board running a Xeon E3-1275 v3 (Haswell) and Fedora 19, I see 9.6 GB/s for memcpy (104ms). The RAM on the Haswell system is DDR3-1600 (same as the other systems).
UPDATES
- I set the CPU power management to Max Performance and disabled hyperthreading in the BIOS. Based on /proc/cpuinfo, the cores were then clocked at 3 GHz. However, this oddly decreased memory performance by around 10%.
- memtest86+ 4.10 reports bandwidth to main memory of 9091 MB/s. I could not find if this corresponds to read, write, or copy.
- The STREAM benchmark reports 13422 MB/s for copy, but they count bytes as both read and written, so that corresponds to ~6.5 GB/s if we want to compare to the above results.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...