I\'m working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.
Profiling shows that
Try to prefetch data first (you could try to read more data blocks to x64 registers first then do the calculations), check if the data is aligned properly in the memory, put loop code at label aligned to 16, try to remove SIB addressing
You could also try to shorten your code to:
mov rax, QWORD PTR [rdx+r11*8-64]
adc rax, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax