Assembly: REP MOVS mechanism

后端 未结 2 818
春和景丽
春和景丽 2021-02-20 08:03

Looking at the following assembly code:

MOV ESI, DWORD PTR [EBP + C]
MOV ECX, EDI
MOV EAX, EAX
SHR ECX, 2
LEA EDI, DWORD PTR[EBX + 18]
REP MOVS DWORD PTR ES:[E         


        
2条回答
  •  野性不改
    2021-02-20 08:23

    The short explanation about syntax

    At the assembly-code level, two forms of this instruction are allowed: the “explicit-operands” form and the “nooperand” form. The explicit-operands form allows the source and the destination address of the memory to be specified explicitly with symbols. This explicit-operands form is provided to allow documentation; however, note that the documentation provided by this form can be misleading. That is, the symbol does not have to specify the correct source and destination address. The source address is always specified by DS:(RSI/ESI/SI) and the destination address is always specified by ES:(RDI/EDI/DI) registers, which must be loaded correctly before the movsb instruction is executed. This is how I understand the official position of Intel on this issue.

    The long explanation about syntax

    REP MOVS DWORD PTR ES:[EDI], DWORD PTR [ESI] is a synonym for REP MOVSD; and REP MOVS BYTE PTR ES:[EDI], BYTE PTR[ESI] is a synonym of REP MOVSB.

    There are the following MOVS commands, based on data sizes:

    • MOVSB (byte, 8-bit)
    • MOVSW (word, 16-bit)
    • MOVSD (dword, 32-bit)
    • MOVSQ (qword, 64 bit) - only available in 64-bit mode

    The MOVS command copies data from DS:(SI/ESI/RSI) to ES:(DI/EDI/RDI) -- the size of SI/DI register is based on your current mode - 16-bit, 32-bit or 64-bit. It also increases (decreases) SI and DI registers (based on the D flag, set CLD to increase the registers).

    The MOVS command cannot use other registers than SI/DI, so it is not necessary to specify them.

    If the MOVS command is prefixed by REP, it is repeated to copy CX(ECX/RCX) number of bytes, decreasing CX, so at the end CX becomes zero.

    The explanation on relative performance

    Since first Pentium CPU produced in 1993, Intel began to make simple commands to be executed faster and complex commands (like REP MOVS) -- slower. So, REP MOVS became very slow, and there were no more reason to use it in Pentium CPUs based on P5 microarchitecture (1993-1997).

    In parallel with the P5 microarchitecture, Intel developed the P6 microarchitecture, where it has decided to revisit REP MOVS, and, since 1996, implemented the "fast strings" feature which made REP MOVS fast again.

    In 2013, Intel decided to revisit REP MOVS again, and implemented CPUID ERMSB (Enhanced REP MOVSB) bit, which was supposed to indicate that the CPU implements byte-sized move and store instructions (movsb, stosb) in a fast and efficient manner. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:

    • both the source and destination addresses have to be aligned to a 16-byte boundary (this boundary size is recommended for Ivy Bridge processors, on newer the boundary may be larger, up to 64 bytes for Cannonlake);
    • the source region should not overlap with the destination region;
    • the length have to be a multiple of 64 bytes to produce higher performance;
    • the direction have to be forward (CLD).

    See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

    REP MOVS instructions are very slow on small blocks because the startup cost is about 35 cycles. If you do plain simple MOV EAX (or something like that) in a loop, there are no startup costs and you can copy lots of data during these 35 cycles.

    Please note that ERMSB produces best results for REP MOVSB, not REP MOVSD (MOVSQ). All REP MOVS instructions became significantly faster, but REP MOVSB is fastest of all with ERMSB. This is in contrast with older processors (before 2013) where largest MOVS size available (MOVSQ on 64-bit, MOVSD on 32-bit) produced fastest outcome.

    So the code that you have shown is not optimal for processors with ERMSB, because only MOVSB is fast, not MOVSD, although the difference is not that big, and a single REP MOVSB should be enough - it will incur startup costs only once rather than twice for fist REP MOVSD and then REP MOVSB.

    However, for processors without ERMBS, your code is OK, except for P5-based Pentium processors released in 1993 where plain simple MOV EAX copy (or using larger x87 registers) in a loop would be faster. The code that you have given will also give best results on very old processors like 80386 released in 1985.

提交回复
热议问题