Copying to arrays in NASM

问题

I have to write in assembly code which copy 100 bytes in memory in loop. I wrote it like this:

section .data
    a times 100 db 1 ;reserve 100 bytes and fill with 1
    b times 100 db 0 ;reserve 100 bytes and fill with 0

    section _start
    global _start

    _start:
    mov rsi, a ;get array a address
    mov rdi, b ;get arrat b address

    _for: ;początek pętli
    cmp cx, 100     ;loop
    jae _end_for        ;loop
    push cx         ;loop

    mov byte al, [rsi]  ;get one byte from array a from al
    mov byte [rdi], al  ;put one byte from al to array b
    inc rsi         ;set rsi to next byte in array a
    inc rdi         ;set rdi to next byte in array b

    pop cx          ;loop
    inc cx          ;loop
    jmp _for        ;loop

_end_for:

_end:
    mov rax, 60
    mov rdi, 0
    syscall

I'm not sure about the copying part. I read the value from the address to the register and then put it into another. That looks good to me, but I'm not sure about incrementing rsi and rdi.

Is it really enough?
I'm new to NASM and assembly, so please help :-)

回答1:

I know about rep movsb but task has been to make it in loop byte after byte, I don't know if it could be done better way.

If you have to loop 1 byte at a time, here's how to do that efficiently. It's worth mentioning because looping efficiently is useful for cases other than memcpy as well!

First of all, you know that your loop body should run at least once, so you can use a normal loop structure with a conditional branch at the bottom. (Why are loops always compiled into "do...while" style (tail jump)?)

Second, if you're not going to unroll at all then you should use an indexed addressing mode to avoid having to increment both pointers. (But really it would be better to unroll).

And don't use 16-bit registers if you don't have to. Prefer 32-bit operand-size (ECX); writing a 32-bit register implicitly zero-extends to 64-bit so it's safe to use an index as part of an addressing mode.

You can use an indexed load but a non-indexed store so your store-address uops can still run on port7, making this slightly more hyperthreading-friendly on Haswell/Skylake. And avoiding un-lamination on Sandybridge. Obviously copying 1 byte at a time is total garbage for performance, but sometimes you do want to loop and actually do something with each byte while it's in a register, and you can't manually vectorize it with SSE2 (to do 16 bytes at a time).

You can do this by indexing the src relative to the dst.

Or the other trick is to count a negative index up towards zero, so you avoid an extra cmp. Lets do that first:

default rel       ; use RIP-relative addressing modes by default

ARR_SIZE  equ 100
section .data
    a:  times ARR_SIZE db 1

section .bss
    b:  resb ARR_SIZE       ;reserve n bytes of space in the BSS

    ;section _start   ; do *not* use custom section names unless you have a good reason
                      ; they might get linked with unexpected read/write/exec permission

section .text
global _start
_start:
    lea     rsi, [a+ARR_SIZE]   ; pointers to one-past-the-end of the arrays
    lea     rdi, [b+ARR_SIZE]   ; RIP-relative LEA is better than mov r64, imm64

    mov     rcx, -ARR_SIZE

.copy_loop:                 ; do {
    movzx   eax, byte [rsi+rcx]  ; load without a false dependency on the old value of RAX
    mov     [rdi+rcx], al
    inc     rcx
    jnz    .copy_loop       ; }while(++idx != 0);

.end:
    mov  eax, 60
    xor  edi, edi
    syscall             ; sys_exit(0)

In position-dependent code like a static (or other non-PIE) Linux executable, mov edi, b+ARR_SIZE is the most efficient way to put a static address into a register.

Don't use _ for all your label names. _start is named that way because C symbol names that begin with _ are reserved for use by the implementation. It's not something you should copy; in fact the opposite is true.

Use .foo for a local label name inside a function. e.g. .foo: is shorthand for _start.foo: if you use it after _start.

Indexing src relative to dst:

Normally your input and output aren't both in static storage, so you have to sub the addresses at runtime. Here, if we put them both in the same section like you were originally doing, mov rcx, a-b will actually assemble. But if not, NASM refuses.

In fact instead of a 2-register addressing mode, I could just be doing [rdi + (a-b)], or simply [rdi - ARR_SIZE] because I know they're contiguous.

_start:
    lea     rdi, [b]   ; RIP-relative LEA is better than mov r64, imm64
    mov     rcx, a-b   ; distance between arrays so  [rdi+rcx] = [a]
;;; for a-b to assemble, I had to move b back to the .data section.

    lea     rdx, [rdi+ARR_SIZE]    ; end_dst pointer

.copy_loop:                 ; do {
    movzx   eax, byte [rdi + rcx]    ; src = dst+(src-dst)
    mov     [rdi], al
    inc     rdi

    cmp     rdi, rdx
    jbe    .copy_loop       ; }while(dst < end_dst);

An end-of-the-array pointer is exactly like you'd do in C++ with foo.end() to get a pointer / iterator to one-past-the-end.

This needs INC + CMP/JCC as loop overhead. On AMD CPUs, CMP/JCC can macro-fuse into 1 uop but INC/JCC can't, so the extra CMP vs. indexing from the end is basically free. (Except for code-size).

On Intel this avoids an indexed store. The load is a pure load in this case, so it's a single uop anyway without needing to stay micro-fused with an ALU uop. Intel can macro-fuse inc/jcc so this does cost an extra uop of loop overhead.

This way of looping is good if you're unrolling, if you don't need to avoid an indexed addressing mode for loads. But if you're using a memory source for an ALU instruction like vaddps ymm0, ymm1, [rdi], then yes you should increment both pointers separately so you can use non-indexed addressing modes for both loads and stores, because Intel CPUs are more efficient that way. (Port 7 store AGU handles non-indexed only, and some micro-fused loads unlaminate with indexed addressing mode. Micro fusion and addressing modes)

回答2:

Is it really enough?

Yes; the code you've shown is enough to copy the array.

For performance/optimization the code you've shown could be better; but optimization is a slippery slope that takes a detour through "rep movsb is better for code size", passes through "SIMD with loop unrolling" and ends at "you can avoid the need to copy the array".

来源：https://stackoverflow.com/questions/56409664/copying-to-arrays-in-nasm

标签

assembly

x86-64

nasm