We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.
[I would make this a comment, but do not have enough reputation to do so.]
I have a similar system and see similar results, but can add a few data points:
memcpy
(i.e. convert to *p_dest-- = *p_src--
), then you may get much worse performance than for the forward direction (~637 ms for me). There was a change in memcpy()
in glibc 2.12 that exposed several bugs for calling memcpy
on overlapping buffers (http://lwn.net/Articles/414467/) and I believe the issue was caused by switching to a version of memcpy
that operates backwards. So, backward versus forward copies may explain the memcpy()
/memmove()
disparity.memcpy()
implementations switch to non-temporal stores (which are not cached) for large buffers (i.e. larger than the last level cache). I tested Agner Fog's version of memcpy (http://www.agner.org/optimize/#asmlib) and found that it was approximately the same speed as the version in glibc
. However, asmlib
has a function (SetMemcpyCacheLimit
) that allows setting the threshold above which non-temporal stores are used. Setting that limit to 8GiB (or just larger than the 1 GiB buffer) to avoid the non-temporal stores doubled performance in my case (time down to 176ms). Of course, that only matched the forward-direction naive performance, so it is not stellar.memcpy
(104ms). The RAM on the Haswell system is DDR3-1600 (same as the other systems).UPDATES
/proc/cpuinfo
, the cores were then clocked at 3 GHz. However, this oddly decreased memory performance by around 10%.