I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture
Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (we still need to check the corresponding bit) and allowed us to copy memory fast.
Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of the Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB.
REP MOVSB (ERMSB) is only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it is much slower, because there is high internal startup in ERMSB - about 35 cycles.
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
- startup cost is 35 cycles;
- both the source and destination addresses have to be aligned to a 16-Byte boundary;
- the source region should not overlap with the destination region;
- the length have to be a multiple of 64 to produce higher performance;
- the direction have to be forward (CLD).
As I said earlier, REP MOVSB begin to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length have to be more than 2048 bytes. Also, it should be noted that merely using AVX (256-bit registers) or AVX-512 (512-bit registers) for memory copy may sometimes have bad consequences like AVX/SSE transition penalties or reduced turbo frequency. So the REP MOVSB is a safer way to copy memory than AVX.
On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:
- if the source buffer is not aligned, the impact on ERMSB implementation versus 128-bit AVX is similar;
- if the destination buffer is not aligned, the impact on ERMSB implementation can be 25% degradation, while 128-bit AVX implementation of memcpy may degrade only 5%, relative to 16-byte aligned scenario.
I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache:
REP MOVSB memcpy():
- 1622400000 data blocks of 32 bytes took 17.9337 seconds to copy; 2760.8205 MB/s
- 1622400000 data blocks of 64 bytes took 17.8364 seconds to copy; 5551.7463 MB/s
- 811200000 data blocks of 128 bytes took 10.8098 seconds to copy; 9160.5659 MB/s
- 405600000 data blocks of 256 bytes took 5.8616 seconds to copy; 16893.5527 MB/s
- 202800000 data blocks of 512 bytes took 3.9315 seconds to copy; 25187.2976 MB/s
- 101400000 data blocks of 1024 bytes took 2.1648 seconds to copy; 45743.4214 MB/s
- 50700000 data blocks of 2048 bytes took 1.5301 seconds to copy; 64717.0642 MB/s
- 25350000 data blocks of 4096 bytes took 1.3346 seconds to copy; 74198.4030 MB/s
- 12675000 data blocks of 8192 bytes took 1.1069 seconds to copy; 89456.2119 MB/s
- 6337500 data blocks of 16384 bytes took 1.1120 seconds to copy; 89053.2094 MB/s
MOV RAX... memcpy():
- 1622400000 data blocks of 32 bytes took 7.3536 seconds to copy; 6733.0256 MB/s
- 1622400000 data blocks of 64 bytes took 10.7727 seconds to copy; 9192.1090 MB/s
- 811200000 data blocks of 128 bytes took 8.9408 seconds to copy; 11075.4480 MB/s
- 405600000 data blocks of 256 bytes took 8.4956 seconds to copy; 11655.8805 MB/s
- 202800000 data blocks of 512 bytes took 9.1032 seconds to copy; 10877.8248 MB/s
- 101400000 data blocks of 1024 bytes took 8.2539 seconds to copy; 11997.1185 MB/s
- 50700000 data blocks of 2048 bytes took 7.7909 seconds to copy; 12710.1252 MB/s
- 25350000 data blocks of 4096 bytes took 7.5992 seconds to copy; 13030.7062 MB/s
- 12675000 data blocks of 8192 bytes took 7.4679 seconds to copy; 13259.9384 MB/s
So, even on 128-bit blocks, REP MOVSB is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting form 256-byte blocks.
#Normal (not enhanced) REP MOVS on Nehalem and later#
Surprisingly, previous architectures (Nehalem and later), that didn't yet have Enhanced REP MOVB, had quite fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.
Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture - Intel Core i5, i7 and Xeon processors released in 2009 and 2010.
The latency for MOVSB, is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX > 9 have a 50-cycle startup cost.
My conclusion: REP MOVSB is almost useless on Nehalem.
Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):
- Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
- Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary: = Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles. = Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6 cycles.
- Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.
Intel does not seem to be correct here. From the above quote we understand that for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ, but tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.
According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.
Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!
#REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache #
Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/
Yonah (2006-2008)
REP MOVSB 10.91 B/c
REP MOVSW 10.85 B/c
REP MOVSD 11.05 B/c
Nehalem (2009-2010)
REP MOVSB 25.32 B/c
REP MOVSW 19.72 B/c
REP MOVSD 27.56 B/c
REP MOVSQ 27.54 B/c
Westmere (2010-2011)
REP MOVSB 21.14 B/c
REP MOVSW 19.11 B/c
REP MOVSD 24.27 B/c
Ivy Bridge (2012-2013) - with Enhanced REP MOVSB (all subsequent CPUs also have Enhanced REP MOVSB)
REP MOVSB 28.72 B/c
REP MOVSW 19.40 B/c
REP MOVSD 27.96 B/c
REP MOVSQ 27.89 B/c
SkyLake (2015-2016)
REP MOVSB 57.59 B/c
REP MOVSW 58.20 B/c
REP MOVSD 58.10 B/c
REP MOVSQ 57.59 B/c
Kaby Lake (2016-2017)
REP MOVSB 58.00 B/c
REP MOVSW 57.69 B/c
REP MOVSD 58.00 B/c
REP MOVSQ 57.89 B/c
Cannon Lake, mobile (May 2018 - February 2020)
REP MOVSB 107.44 B/c
REP MOVSW 106.74 B/c
REP MOVSD 107.08 B/c
REP MOVSQ 107.08 B/c
Cascade lake, server (April 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.51 B/c
REP MOVSD 58.51 B/c
REP MOVSQ 58.20 B/c
Comet Lake, desktop, workstation, mobile (August 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.62 B/c
REP MOVSD 58.72 B/c
REP MOVSQ 58.72 B/c
Ice Lake, mobile (September 2019)
REP MOVSB 102.40 B/c
REP MOVSW 101.14 B/c
REP MOVSD 101.14 B/c
REP MOVSQ 101.14 B/c
Tremont, low power (September, 2020)
REP MOVSB 119.84 B/c
REP MOVSW 121.78 B/c
REP MOVSD 121.78 B/c
REP MOVSQ 121.78 B/c
Tiger Lake, mobile (October, 2020)
REP MOVSB 93.27 B/c
REP MOVSW 93.09 B/c
REP MOVSD 93.09 B/c
REP MOVSQ 93.09 B/c
As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge - REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well - you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB, REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB, while in fact REP MOVSB became very fast only since SkyLake (2015) - twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing - it only shows that REP MOVSB
per se is OK, but not that any REP MOVS*
is faster.
The most confusing ERMBSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer, a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.