I\'m working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.
Profiling shows that
The most difficult dependency is the propagation of carry between every memory block; I'd try to first device a method of dealing with that.
The following fragment simulates carry propagation, but with the "benefit" of not using the carry flag. This can be parallelised for three or four separate threads, each producing a half carry about 25000 decimal digits (or 10000 bytes) apart. Then the probability of those carries affecting only one byte, word, dword etc. will asymptotically reach zero.
long long carry=0;
for (int i=0;i>=32;
}
According to my profiling, carryless addition using xmm would take ~550ms (1e9 words), the simulated carry would take ~1020ms and 4-way parallelized version would take ~820ms (without any assembler optimization).
Architectural optimizations could include using redundant number system, where the carry doesn't have to be propagated all the time and where the evaluation of carry could be postponed almost ad infinitum.