Speed up x64 assembler ADD loop

问题

I'm working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.

Profiling shows that my code runs up to 25% of it's time in the add() and sub() routines, so it's important they are as fast as possible. But I don't see much potential, yet. Maybe you can give me some help, advice, insight or ideas. I'll test them and get back to you.

So far my add routine does some setup and then uses a 8-times unrolled loop:

mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
adc rax, r10
mov QWORD PTR [rcx+r11*8-64], rax

7 more blocks with different offsets follow and then it loops.

I tried loading the values from memory earlier, but that didn't help. I guess that is because of good prefetching. I use an Intel i7-3770 Ivy Bridge 4-core CPU. But I'd like to write code that works good on any modern CPU.

Edit: I did some timings: It adds 1k words in about 2.25 cycles/word. If I remove the ADC, so only the MOVs remain, it still takes about 1.95 cycles/word. So the main bottleneck seems to be the memory access. A library memcpy() works in about 0.65 cycles/word, but has only one input, not two. Still, it's much faster because of its use of SSE registers, I guess.

Some questions:

Is it useful to use "load, load, add, store" structure or would a "load, add-to-memory" help? So far my tests didn't show any advantages.
As usual, there is no help from SSE(2,3,4) to be expected?
Does the addressing (scaled index plus base plus offset) impact badly? I could use ADD r11, 8 instead.
What about the loop unrolling? I read unrolling was bad for Sandy Bridge architecture (Agner Fog http://www.agner.org/optimize/). Is it to be preferred or avoided?
(Edit) Can I use SSE registers to load and store words in larger chunks from memory and efficiently exchange words with general purpose registers and SSE registers?

I highly appreciate any comments.

回答1:

I'm pretty sure memcpy is faster because it doesn't have a dependency on the data being fetched before it can perform the next operation.

If you can arrange your code so that it does something like this:

mov rax, QWORD PTR [rdx+r11*8-64]
mov rbx, QWORD PTR [rdx+r11*8-56]
mov r10, QWORD PTR [r8+r11*8-64]
mov r12, QWORD PTR [r8+r11*8-56]
adc rax, r10
adc rbx, r12
mov QWORD PTR [rcx+r11*8-64], rax
mov QWORD PTR [rcx+r11*8-56], rbx

I'm not 100% sure that the offset of -56 is the right for your code, but the concept is "right".

I would also consider cache-hits/cache-collisions. E.g. if you have three blocks of data [which it would seem that you do], you make sure they are NOT aligned to the same offset in the cache. A bad example would be if you allocate all your blocks at a multiple of the cache-size, from the same place in the cache. Over-allocating and making SURE that your different data blocks are offset by at least 512 byte [so allocate 4K oversize, and round up to 4K boundary start address, then add 512 to the second buffer, and 1024 to the third buffer]

If your data is large enough (bigger than L2 cache), you may want to use MOVNT to fetch/store your data. That will avoid reading into the cache - this is ONLY of benefit when you have very large data, where the next read will simply cause something else that you may find "useful" to be kicked out of the cache, and you won't get back to it before you've kicked it out of the cache anyways - so storing the value in the cache won't actually help...

Edit: Using SSE or similar won't help, as covered here: Can long integer routines benefit from SSE?

回答2:

The most difficult dependency is the propagation of carry between every memory block; I'd try to first device a method of dealing with that.

The following fragment simulates carry propagation, but with the "benefit" of not using the carry flag. This can be parallelised for three or four separate threads, each producing a half carry about 25000 decimal digits (or 10000 bytes) apart. Then the probability of those carries affecting only one byte, word, dword etc. will asymptotically reach zero.

 long long carry=0;
 for (int i=0;i<N;i++) {
     carry += (long long)*a++ + (long long)*b++;
     *c++ = carry; carry>>=32;
 }

According to my profiling, carryless addition using xmm would take ~550ms (1e9 words), the simulated carry would take ~1020ms and 4-way parallelized version would take ~820ms (without any assembler optimization).

Architectural optimizations could include using redundant number system, where the carry doesn't have to be propagated all the time and where the evaluation of carry could be postponed almost ad infinitum.

回答3:

Try to prefetch data first (you could try to read more data blocks to x64 registers first then do the calculations), check if the data is aligned properly in the memory, put loop code at label aligned to 16, try to remove SIB addressing

You could also try to shorten your code to:

mov rax, QWORD PTR [rdx+r11*8-64]
adc rax, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax

来源：https://stackoverflow.com/questions/13971088/speed-up-x64-assembler-add-loop

标签

optimization

assembly

x86-64

bignum

arbitrary-precision