Speed up x64 assembler ADD loop

后端 未结 3 2214
悲&欢浪女
悲&欢浪女 2021-02-20 05:27

I\'m working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.

Profiling shows that

相关标签:
3条回答
  • 2021-02-20 05:55

    The most difficult dependency is the propagation of carry between every memory block; I'd try to first device a method of dealing with that.

    The following fragment simulates carry propagation, but with the "benefit" of not using the carry flag. This can be parallelised for three or four separate threads, each producing a half carry about 25000 decimal digits (or 10000 bytes) apart. Then the probability of those carries affecting only one byte, word, dword etc. will asymptotically reach zero.

     long long carry=0;
     for (int i=0;i<N;i++) {
         carry += (long long)*a++ + (long long)*b++;
         *c++ = carry; carry>>=32;
     }
    

    According to my profiling, carryless addition using xmm would take ~550ms (1e9 words), the simulated carry would take ~1020ms and 4-way parallelized version would take ~820ms (without any assembler optimization).

    Architectural optimizations could include using redundant number system, where the carry doesn't have to be propagated all the time and where the evaluation of carry could be postponed almost ad infinitum.

    0 讨论(0)
  • 2021-02-20 06:13

    Try to prefetch data first (you could try to read more data blocks to x64 registers first then do the calculations), check if the data is aligned properly in the memory, put loop code at label aligned to 16, try to remove SIB addressing

    You could also try to shorten your code to:

    mov rax, QWORD PTR [rdx+r11*8-64]
    adc rax, QWORD PTR [r8+r11*8-64]
    mov QWORD PTR [rcx+r11*8-64], rax
    
    0 讨论(0)
  • 2021-02-20 06:17

    I'm pretty sure memcpy is faster because it doesn't have a dependency on the data being fetched before it can perform the next operation.

    If you can arrange your code so that it does something like this:

    mov rax, QWORD PTR [rdx+r11*8-64]
    mov rbx, QWORD PTR [rdx+r11*8-56]
    mov r10, QWORD PTR [r8+r11*8-64]
    mov r12, QWORD PTR [r8+r11*8-56]
    adc rax, r10
    adc rbx, r12
    mov QWORD PTR [rcx+r11*8-64], rax
    mov QWORD PTR [rcx+r11*8-56], rbx
    

    I'm not 100% sure that the offset of -56 is the right for your code, but the concept is "right".

    I would also consider cache-hits/cache-collisions. E.g. if you have three blocks of data [which it would seem that you do], you make sure they are NOT aligned to the same offset in the cache. A bad example would be if you allocate all your blocks at a multiple of the cache-size, from the same place in the cache. Over-allocating and making SURE that your different data blocks are offset by at least 512 byte [so allocate 4K oversize, and round up to 4K boundary start address, then add 512 to the second buffer, and 1024 to the third buffer]

    If your data is large enough (bigger than L2 cache), you may want to use MOVNT to fetch/store your data. That will avoid reading into the cache - this is ONLY of benefit when you have very large data, where the next read will simply cause something else that you may find "useful" to be kicked out of the cache, and you won't get back to it before you've kicked it out of the cache anyways - so storing the value in the cache won't actually help...

    Edit: Using SSE or similar won't help, as covered here: Can long integer routines benefit from SSE?

    0 讨论(0)
提交回复
热议问题