I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture
You say that you want:
an answer that shows when ERMSB is useful
But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:
implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.
So just because CPUID
indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.
However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.
Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.
You also ask:
Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?
It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.
If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.
Or you can just... Well, you asked us not to say it.