问题
I have to write in assembly code which copy 100 bytes in memory in loop. I wrote it like this:
section .data
a times 100 db 1 ;reserve 100 bytes and fill with 1
b times 100 db 0 ;reserve 100 bytes and fill with 0
section _start
global _start
_start:
mov rsi, a ;get array a address
mov rdi, b ;get arrat b address
_for: ;początek pętli
cmp cx, 100 ;loop
jae _end_for ;loop
push cx ;loop
mov byte al, [rsi] ;get one byte from array a from al
mov byte [rdi], al ;put one byte from al to array b
inc rsi ;set rsi to next byte in array a
inc rdi ;set rdi to next byte in array b
pop cx ;loop
inc cx ;loop
jmp _for ;loop
_end_for:
_end:
mov rax, 60
mov rdi, 0
syscall
I'm not sure about the copying part. I read the value from the address to the register and then put it into another. That looks good to me, but I'm not sure about incrementing rsi
and rdi
.
Is it really enough?
I'm new to NASM and assembly, so please help :-)
回答1:
I know about rep movsb but task has been to make it in loop byte after byte, I don't know if it could be done better way.
If you have to loop 1 byte at a time, here's how to do that efficiently. It's worth mentioning because looping efficiently is useful for cases other than memcpy
as well!
First of all, you know that your loop body should run at least once, so you can use a normal loop structure with a conditional branch at the bottom. (Why are loops always compiled into "do...while" style (tail jump)?)
Second, if you're not going to unroll at all then you should use an indexed addressing mode to avoid having to increment both pointers. (But really it would be better to unroll).
And don't use 16-bit registers if you don't have to. Prefer 32-bit operand-size (ECX); writing a 32-bit register implicitly zero-extends to 64-bit so it's safe to use an index as part of an addressing mode.
You can use an indexed load but a non-indexed store so your store-address uops can still run on port7, making this slightly more hyperthreading-friendly on Haswell/Skylake. And avoiding un-lamination on Sandybridge. Obviously copying 1 byte at a time is total garbage for performance, but sometimes you do want to loop and actually do something with each byte while it's in a register, and you can't manually vectorize it with SSE2 (to do 16 bytes at a time).
You can do this by indexing the src relative to the dst.
Or the other trick is to count a negative index up towards zero, so you avoid an extra cmp
. Lets do that first:
default rel ; use RIP-relative addressing modes by default
ARR_SIZE equ 100
section .data
a: times ARR_SIZE db 1
section .bss
b: resb ARR_SIZE ;reserve n bytes of space in the BSS
;section _start ; do *not* use custom section names unless you have a good reason
; they might get linked with unexpected read/write/exec permission
section .text
global _start
_start:
lea rsi, [a+ARR_SIZE] ; pointers to one-past-the-end of the arrays
lea rdi, [b+ARR_SIZE] ; RIP-relative LEA is better than mov r64, imm64
mov rcx, -ARR_SIZE
.copy_loop: ; do {
movzx eax, byte [rsi+rcx] ; load without a false dependency on the old value of RAX
mov [rdi+rcx], al
inc rcx
jnz .copy_loop ; }while(++idx != 0);
.end:
mov eax, 60
xor edi, edi
syscall ; sys_exit(0)
In position-dependent code like a static (or other non-PIE) Linux executable, mov edi, b+ARR_SIZE
is the most efficient way to put a static address into a register.
Don't use _
for all your label names. _start
is named that way because C symbol names that begin with _
are reserved for use by the implementation. It's not something you should copy; in fact the opposite is true.
Use .foo
for a local label name inside a function. e.g. .foo:
is shorthand for _start.foo:
if you use it after _start
.
Indexing src relative to dst:
Normally your input and output aren't both in static storage, so you have to sub
the addresses at runtime. Here, if we put them both in the same section like you were originally doing, mov rcx, a-b
will actually assemble. But if not, NASM refuses.
In fact instead of a 2-register addressing mode, I could just be doing [rdi + (a-b)]
, or simply [rdi - ARR_SIZE]
because I know they're contiguous.
_start:
lea rdi, [b] ; RIP-relative LEA is better than mov r64, imm64
mov rcx, a-b ; distance between arrays so [rdi+rcx] = [a]
;;; for a-b to assemble, I had to move b back to the .data section.
lea rdx, [rdi+ARR_SIZE] ; end_dst pointer
.copy_loop: ; do {
movzx eax, byte [rdi + rcx] ; src = dst+(src-dst)
mov [rdi], al
inc rdi
cmp rdi, rdx
jbe .copy_loop ; }while(dst < end_dst);
An end-of-the-array pointer is exactly like you'd do in C++ with foo.end()
to get a pointer / iterator to one-past-the-end.
This needs INC + CMP/JCC as loop overhead. On AMD CPUs, CMP/JCC can macro-fuse into 1 uop but INC/JCC can't, so the extra CMP vs. indexing from the end is basically free. (Except for code-size).
On Intel this avoids an indexed store. The load is a pure load in this case, so it's a single uop anyway without needing to stay micro-fused with an ALU uop. Intel can macro-fuse inc/jcc
so this does cost an extra uop of loop overhead.
This way of looping is good if you're unrolling, if you don't need to avoid an indexed addressing mode for loads. But if you're using a memory source for an ALU instruction like vaddps ymm0, ymm1, [rdi]
, then yes you should increment both pointers separately so you can use non-indexed addressing modes for both loads and stores, because Intel CPUs are more efficient that way. (Port 7 store AGU handles non-indexed only, and some micro-fused loads unlaminate with indexed addressing mode. Micro fusion and addressing modes)
回答2:
Is it really enough?
Yes; the code you've shown is enough to copy the array.
For performance/optimization the code you've shown could be better; but optimization is a slippery slope that takes a detour through "rep movsb
is better for code size", passes through "SIMD with loop unrolling" and ends at "you can avoid the need to copy the array".
来源:https://stackoverflow.com/questions/56409664/copying-to-arrays-in-nasm