Speed up x64 assembler ADD loop

后端 未结 3 2215
悲&欢浪女
悲&欢浪女 2021-02-20 05:27

I\'m working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.

Profiling shows that

3条回答
  •  甜味超标
    2021-02-20 06:17

    I'm pretty sure memcpy is faster because it doesn't have a dependency on the data being fetched before it can perform the next operation.

    If you can arrange your code so that it does something like this:

    mov rax, QWORD PTR [rdx+r11*8-64]
    mov rbx, QWORD PTR [rdx+r11*8-56]
    mov r10, QWORD PTR [r8+r11*8-64]
    mov r12, QWORD PTR [r8+r11*8-56]
    adc rax, r10
    adc rbx, r12
    mov QWORD PTR [rcx+r11*8-64], rax
    mov QWORD PTR [rcx+r11*8-56], rbx
    

    I'm not 100% sure that the offset of -56 is the right for your code, but the concept is "right".

    I would also consider cache-hits/cache-collisions. E.g. if you have three blocks of data [which it would seem that you do], you make sure they are NOT aligned to the same offset in the cache. A bad example would be if you allocate all your blocks at a multiple of the cache-size, from the same place in the cache. Over-allocating and making SURE that your different data blocks are offset by at least 512 byte [so allocate 4K oversize, and round up to 4K boundary start address, then add 512 to the second buffer, and 1024 to the third buffer]

    If your data is large enough (bigger than L2 cache), you may want to use MOVNT to fetch/store your data. That will avoid reading into the cache - this is ONLY of benefit when you have very large data, where the next read will simply cause something else that you may find "useful" to be kicked out of the cache, and you won't get back to it before you've kicked it out of the cache anyways - so storing the value in the cache won't actually help...

    Edit: Using SSE or similar won't help, as covered here: Can long integer routines benefit from SSE?

提交回复
热议问题