Difference between db and dw when defining strings

问题

In NASM assembly, there are db and dw pseudo instructions to declare data. NASM Manual provides a couple of examples but doesn't say directly what's the difference between them. I've tried the following "hello world" code with both of them, and it turned out that no difference is observable. I suspect the distinct has something to do with internal data format, but I don't know how to inspect that.

section .data
        msg db "hello world",10,13,0
        msg2 dw "hello world",10,13,0

section .text
global _start
_start:
        mov rax, 1
        mov rdi, 1
        mov rsi, msg ; or use msg2
        mov rdx, 14
        syscall
        jmp .exit

.exit:
        mov rax, 60
        mov rdi, 0
        syscall

回答1:

NASM produces WORDs anyhow ;-)

dw 'a' is equivalent to dw 0x61 and stores the WORD 0x0061 (big-endian) as 61 00 (little-endian).
dw 'ab' (little-endian) is equivalent to dw 0x6261 (big-endian) and stores 61 62 (little-endian).
dw 'abc' (one word, one byte) is equivalent to dw 0x6261, 0x63 and stores two WORDS (little-endian): 61 62 63 00.
dw 'abcd' (two words) stores two WORDs: 61 62 63 64.

msg2 dw "hello world",10,13,0 transfers the string into 6 words and the numbers to 3 words and stores it: 68 65 6C 6C 6F 20 77 6F 72 6C 64 00 0A 00 0D 00. In your example, msg won't be printed until its end.

回答2:

The NASM manual sections 3.2.1 DB and Friends: Declaring Initialized Data and 3.4.2 Character Strings indicate that there is a difference when the individual strings are shorter than the element size. Each element is padded with zero bytes to its native size.

To ensure that you do not have unintended characters in the data, always use DB for 8-bit strings. DW may or may not work for UTF-16 depending on the machine byte order and any assumptions in the code.

Using DW pseudo instructions will definitely result in unexpected values for the numeric values as these will be interpreted as 16-bit words introducing unexpected null characters into the string.

Use 2.1.3 The -l Option: Generating a Listing File to see the actual memory image being output to see the content you are generating.

回答3:

NASM's db, dw, dd, etc. accept a list of integers, and encode them into the output as little-endian, e.g. dw 0x1234, 0x5678 assembles to 34 12 78 56.

NASM also supports multi-character character literals, like 'ab' in any context where it accepts an integer, e.g. add ax, '00' is the same as 0x3030. (maybe for unpacked-BCD->ASCII conversion.)

NASM's byte ordering for multi-character literals produces the same order in memory as the source order on little-endian x86. So for example, mov eax, '1234' / mov [buf], eax will produce the same 4-byte sequence in memory as buf: db '1', '2', '3', '4'. The mov-immediate instruction is encoded as b8 31 32 33 34 because x86 immediate operands use little-endian, just like data loads/stores.

There is a special case for args to db/dw/dd/etc: instead of truncating like for add ax, '123456', (foo.asm:1: warning: word data exceeds bounds [-w+number-overflow]) to keep only the low byte / word / dword of the integer value, the ASCII or UTF-8 string is treated as multiple elements.

But the last element is padded with zeros (at the end because little-endian) to make the total size a multiple of the element size (word for dw / dword for dd / etc.)

So all of these are exactly equivalent

 db 0x61, 0x62,   0x63, 0x64,   0x65, 0x66,   0x67, 0x00
 db 'a', 'b', 'c', 'd', 'e', 'f', 'g', 0
 db 'abcdefg', 0
 db `abcdefg\0`                  ; C-style escapes like \n or \0 work inside backtick strings only.  (NASM only, not YASM)

 dw 'abcdefg'                    ; 7 bytes padded to 4 words = 8
 dw 'ab', 'cd', 'ef', 'g'        ; 'g' is a small WORD value
 dw 0x6261, 0x6463, 0x6665, 0x0067   ; x86 is little-endian (LSB first) but we write integer values with MSD on the left

 dd 'abcdefg'                    ; 7 bytes padded to 2 dwords = 8

With dd and dq, (and do and other wider types), you can have more than 1 byte of zeros, but always at the end. e.g. dd 'abcde' is db 'abcde', 0,0,0.

See the NASM manual, 3.2.1 DB and Friends: Declaring Initialized Data and 3.4.2 Character Strings

Source values, and integer values like 0x123456, don't have an "endianness". That's a misuse of the terminology.

Endianness is the effect you see when serializing a multi-byte integer into memory and then examining the individual bytes in order of increasing address.

Our Arabic numeral left-to-right from MSD to LSD convention is a separate thing from byte order in memory. It's also mostly arbitrary that we usually make diagrams of memory with addresses increasing from left to right, and that our way of writing numbers happens to look like Big Endian.

We could just as well use Roman numerals to represent numbers in source code and/or pseudocode, or any other counting system. (e.g. unary, where 3 = 111).

It's not a bad mental shortcut, though, to remember that "little endian makes numbers backwards". But it's on byte boundaries, not hex-digit (4-bit) boundaries.

But don't fall into the trap of thinking that values in registers are "big endian". They don't have an endianness except when stored in memory.

来源：https://stackoverflow.com/questions/28076943/difference-between-db-and-dw-when-defining-strings

标签

assembly

nasm

x86-64