I want to know if using the MOV instruction to copy a string into a register causes the string to be stored in reverse order. I learned that when MASM stores a string into a
Don't use strings in contexts where MASM expects a 16-bit or larger integer. MASM will convert them to integers in a way that reverses the order of characters when stored in memory. Since this is confusing it's best to avoid this, and only use strings with the DB directive, which works as expected. Don't use strings with more than character as immediate values.
Registers don't have addresses, and it's meaningless to talk about the order of bytes within a register. On a 32-bit x86 CPU, the general purpose registers like EAX hold 32-bit integer values. You can divide a 32-bit value conceptually into 4 bytes, but while it lives in a register there is no meaningful order to the bytes.
It's only when 32-bit values exist in memory do the 4 bytes that make them up have addresses and so have an order. Since x86 CPUs use the little-endian byte order that means the least-significant byte of the 4 bytes is the first byte. The most-significant part becomes the last byte. Whenever the x86 loads or stores a 16-bit or wider value to or from memory it uses the little-endian byte order. (An exception is the MOVBE instruction which specifically uses the big-endian byte order when loading and storing values.)
.MODEL flat
.DATA
db_str DB "abcd"
dd_str DD "abcd"
num DD 1684234849
.CODE
_start:
mov eax, "abcd"
mov ebx, DWORD PTR [db_str]
mov ecx, DWORD PTR [dd_str]
mov edx, 1684234849
mov esi, [num]
int 3
END _start
After assembling and linking it gets converted into sequence of bytes something like this:
.text section:
00401000: B8 64 63 62 61 8B 1D 00 30 40 00 8B 0D 04 30 40 ,dcba...0@....0@
00401010: 00 BA 61 62 63 64 8B 35 08 30 40 00 CC .ºabcd.5.0@.I
...
.data section:
00403000: 61 62 63 64 64 63 62 61 61 62 63 64 abcddcbaabcd
(On Windows the .data
section normally gets placed after the .text
section in memory.)
So we can see that the DB and DD directives, the ones labelled db_str
and dd_str
, generates two different sequences of bytes for the same string "abcd"
. In the first case, the MASM generates a sequence of bytes that we would we would expect, 61h, 62h, 63h, and 64h, the ASCII values for a
, b
, c
, and d
respectively. For dd_str
though the sequence of bytes is reversed. This is because the DD directive uses 32-bit integers as operands, so the string has to be converted to a 32-bit value and MASM ends up reversing the order of characters in the string when the result of the conversion gets stored in memory.
You'll also notice the DD directive labelled num
also generated the same sequence of bytes that the DB directive. Indeed, without looking at the source there's no way to tell that the first four bytes are supposed to be a string while the last four bytes are supposed to be a number. They only become strings or numbers if the program uses them that way.
(Less obvious is how the decimal value 1684234849 was converted into the same sequence bytes as generated by the DB directive. It's already a 32-bit value, it just needs to be converted into a sequence of bytes by MASM. Unsurprisingly, the assembler does so using the same little-endian byte order that the CPU uses. That means the first byte is the least significant part of 1684234849 which happens to have the same value as the ASCII letter a
(1684234849 % 256 = 97 = 61h). The last byte is the most significant part of the number, which happens to be the ASCII value of d
(1684234849 / 256 / 256 / 256 = 100 = 64h).)
Looking the the values in the .text
section more closely with a disassembler, we can see how the sequence of bytes stored there will interpreted as instructions when executed by the CPU:
00401000: B8 64 63 62 61 mov eax,61626364h
00401005: 8B 1D 00 30 40 00 mov ebx,dword ptr ds:[00403000h]
0040100B: 8B 0D 04 30 40 00 mov ecx,dword ptr ds:[00403004h]
00401011: BA 61 62 63 64 mov edx,64636261h
00401016: 8B 35 08 30 40 00 mov esi,dword ptr ds:[00403008h]
0040101C: CC int 3
What we can see here is that that MASM stored the bytes that make up the immediate value in the instruction mov eax, "abcd"
in the same order it did with the dd_str
DD directive. The first byte of the immediate part of the instruction in memory is 64h, the ASCII value of d
. The reason why is because the with a 32-bit destination register this MOV instruction uses a 32-bit immediate. That means that MASM needs to convert the string to a 32-bit integer and ends up reversing the order of bytes as it did with dd_str
. MASM also handles the decimal number given as the immediate to the mov ecx, 1684234849
the same way it did with the DD directive that used the same number. The 32-bit value was converted to same little-endian representation.
You'll also notice that the disassembler generated assembly instructions that use hexadecimal values for the immediates of these two instruction. Like the CPU, the assembler has no way of knowing that immediate values are supposed be strings and decimal numbers. They're just a sequence of bytes in the program, all it knows is that they're 32-bit immediate values (from the opcodes B8h and B9h) and so displays them as 32-bit hexadecimal values for the lack of any better alternative.
By executing the program under a debugger and inspecting the registers after it reaches the breakpoint instruction (int 3
) we can see what actually ended up in the registers:
eax=61626364 ebx=64636261 ecx=61626364 edx=64636261 esi=64636261 edi=00000000
eip=0040101c esp=0018ff8c ebp=0018ff94 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
image00000000_00400000+0x101c:
0040101c cc int 3
Now we can see that the first and third instructions loaded a different value than the other instructions. These two instruction both involve cases where MASM converted the string to a 32-bit value and ended up reversing order of the characters in memory. The register dump confirms that reversed order of bytes in memory in memory results in different values being loaded into the registers.
Now you might be looking at that register dump above and thinking that only EAX and ECX is in the correct order, with the ASCII value for a
, 61h first and and the ASCII value for d
, 64h last. That MASM reversing the order of the strings in memory actually caused them to be loaded into registers in the correct order. But as I said before, there's no byte order in registers. The number 61626364
is just how the debugger represents the value when displaying it as a sequence of characters you can read. The characters 61
come first in the debugger's representation because our numbering system puts the most significant part of the number on the left, and we read left-to-right so that makes it the first part. However, as I also said before, x86 CPUs are little-endian, which means the least significant part comes first in memory. That means the first byte in memory becomes the least significant part of the value in the register, which gets displayed as the rightmost two hexadecimal digits of the number by the debugger because that's where least significant part the number goes in our numbering system.
In other words because x86 CPUs are little-endian, least significant first, but our numbering system is big-endian, most significant first, hexadecimal numbers get displayed in a byte-wise reverse order to how they're actually stored in memory.
It should also be hopefully clear by now that loading a string into a register is only something that happens conceptually. The string gets converted into a sequence of bytes by the assembler, which when loaded into a 32-bit register, gets treated as little-endian 32-bit integer in memory. When the 32-bit value in the register is stored in memory the 32-bit value is converted into a sequence of bytes that represent the value in little-endian format. To the CPU your string is just a 32-bit integer it loaded and stored to and from memory.
So that means that if the value loaded into EAX in the sample program is stored to memory with something like mov [mem], eax
then the the 4 bytes stored at mem
will be in the same order as they appeared in the bytes that made up the immediate of mov eax, "abcd"
. That is in the same reversed order, 64h, 63h, 62h, 61h, that MASM put them in the bytes that make up immediate.
Now as to why MASM is reversing the order of strings when converting them to 32-bit integers I don't know, but the moral here is not to use strings as immediates or any other context where they need to be converted to integers. Assemblers are inconsistent on how they convert string literals into integers. (A similar problem occurs in how C compilers convert character literals like 'abcd'
into integers.)
Nothing special happens with the SCASD or MOVSD instrucitons. SCASD treats the four bytes pointed to by EDI as a 32-bit little-endian value, loads it into an unnamed temporary register, compares the temporary register to EAX, and then adds or subtracts 4 from EDI depending on the DF flag. MOVSD loads a 32-bit value in memory pointed to by ESI into an unnamed temporary register, stores the temporary register the 32-bit memory location pointed to by EDI, and then updates ESI and EDI according to the DF flag. (Byte order doesn't matter for MOVSD as the bytes are never used as a 32-bit value, but the order isn't changed.)
I wouldn't try to think of SCASD or MOVSD as FIFO or LIFO because ultimately that depends on how you use them. MOVSD can just as easily be used as part of an implementation of FIFO queue as a LIFO stack. (Compare this to PUSH and POP, which in theory could independently be used part of an implementation of either a FIFO or LIFO data structure, but together can only be used to implement a LIFO stack.)
See @RossRidge's answer for a very detailed description of how MASM works. This answer compares it to NASM which might just be confusing if you only care about MASM.
mov ecx, 4
is four dwords = 16 bytes, when used with repne scasd
.
Simpler would be to omit rep
and just use scasd
.
Or even simpler cmp dword ptr [strLetters], "dcba"
.
If you look at the immediate in the machine code, it will compare equal if it's in the same order in memory as the data, because both are treated as little-endian 32-bit integers. (Because x86 instruction encoding uses little-endian immediates, matching x86's data load/store endianness.)
And yes, for MASM apparently you do need "dcba"
to get the desired byte order when using a string as an integer constant, because MASM treats the first character as "most significant" and puts it last in a 32-bit immediate.
NASM and MASM are very different here. In NASM, mov dword [mem], 'abcd'
produces 'a', 'b', 'c', 'd'
in memory. i.e. byte-at-a-time memory order matches source order. See NASM character constants. Multi-character constants are simply right-justified in a 32-bit little-endian immediate with the string bytes in source order.
e.g.
objdump -d -Mintel disassembly
c7 07 61 62 63 64 mov DWORD PTR [rdi], 0x64636261
NASM source: mov dword [rdi], "abcd"
MASM source: mov dword ptr [rdi], "dcba"
GAS source: AFAIK not possible with a multi-char string literal. You could do something like $'a' + ('b'<<8) + ...
I agree with Ross's suggestion to avoid multi-character string literals in MASM except as an operand to db
. If you want nice sane multi-character literals as immediates, use NASM or EuroAssembler (https://euroassembler.eu/eadoc/#CharNumbers)
Also, don't use jcc
and jmp
, just use a je close
to fall-through or not.
(You did avoid the usual brain-dead idiom of jcc
over a jmp
, here your jz
is sane and the jmp
is totally redundant, jumping to the next instruction.)