We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.
It's possible that some CPU improvements in your IvyBridge-based laptop contribute to this gain over the SandyBridge-based servers.
Page-crossing Prefetch - your laptop CPU would prefetch ahead the next linear page whenever you reach the end of the current one, saving you a nasty TLB miss every time. To try and mitigate that, try building your server code for 2M / 1G pages.
Cache replacement schemes also seem to have been improved (see an interesting reverse engineering here). If indeed this CPU uses a dynamic insertion policy, it would easily prevent your copied data from trying to thrash your Last-Level-Cache (which it can't use effectively anyway due to the size), and save the room for other useful caching like code, stack, page table data, etc..). To test this, you could try rebuilding your naive implementation using streaming loads/stores (movntdq
or similar ones, you can also use gcc builtin for that). This possibility may explain the sudden drop in large data-set sizes.
I believe some improvements were also made with string-copy as well (here), it may or may not apply here, depending on how your assembly code looks like. You could try benchmarking with Dhrystone to test if there's an inherent difference. This may also explain the difference between memcpy and memmove.
If you could get hold of an IvyBridge based server or a Sandy-Bridge laptop it would be simplest to test all these together.