发表新帖

发表新帖

Speed up x64 assembler ADD loop

后端未结

关注

 3  2217

悲&欢浪女 2021-02-20 05:27

I\'m working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.

Profiling shows that

3条回答

攒了一身酷 (楼主)

2021-02-20 05:55
The most difficult dependency is the propagation of carry between every memory block; I'd try to first device a method of dealing with that.

The following fragment simulates carry propagation, but with the "benefit" of not using the carry flag. This can be parallelised for three or four separate threads, each producing a half carry about 25000 decimal digits (or 10000 bytes) apart. Then the probability of those carries affecting only one byte, word, dword etc. will asymptotically reach zero.
```
 long long carry=0;
 for (int i=0;i>=32;
 }
```
According to my profiling, carryless addition using xmm would take ~550ms (1e9 words), the simulated carry would take ~1020ms and 4-way parallelized version would take ~820ms (without any assembler optimization).

Architectural optimizations could include using redundant number system, where the carry doesn't have to be propagated all the time and where the evaluation of carry could be postponed almost ad infinitum.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题