practical BigNum AVX/SSE possible?

房东的猫 提交于 2019-11-27 02:14:41

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest.

Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once.

Let's consider 128-bit addition. Let 128-bit integers be defined by a pair of high and low 64-bit values and let's assume we want to add a 128-bit integer (y_low, y_high) to a 128-bit integer (x_low, x_high). With the scalar 64-bit registers this requires only two instructions

add rax, rdi // x_low  += y_low;
adc rdx, rsi // x_high += y_high + (x_low < y_low);

With SSE/AVX the problem, as others have explain, is that there is no SIMD carry flags. The carry flag has to be calculated and then added. This requires a 64-bit unsigned comparison. The only realistic option for this with SSE is from the AMD XOP instruction vpcomgtuq

vpaddq      xmm2, xmm0, xmm2 // x_low  += y_low;
vpcomgtuq   xmm0, xmm0, xmm2 // x_low  <  y_low
vpaddq      xmm1, xmm1, xmm3 // x_high += y_high
vpsubq      xmm0, xmm1, xmm0 // x_high += xmm0

This uses four instructions to add two pairs of 128-bit numbers. With the scalar 64-bit registers this requires four instructions as well (two add and two adc).

With AVX2 we can add four pairs of 128-bit numbers at once. But there is no 256-bit wide 64-bit unsigned instruction from XOP. Instead we can do the following for a<b:

__m256i sign64 = _mm256_set1_epi64x(0x8000000000000000L);
__m256i aflip = _mm256_xor_si256(a, sign64);
__m256i bflip = _mm256_xor_si256(b, sign64);
__m256i cmp = _mm256_cmpgt_epi64(aflip,bflip);

The sign64 register can be precomputed so only three instructions are really necessary. Therefore, adding four pairs of 128-bit numbers with AVX2 can be done with six instructions

vpaddq
vpaddq
vpxor
vpxor
vpcmpgtq 
vpsubq

whereas the scalar registers need eight instructions.

AVX512 has a single instruction for doing 64-bit unsigned comparison vpcmpuq. Therefore, it should be possible to add eight pairs of 128-bit numbers using only four instructions

vpaddq
vpaddq
vpcmpuq
vpsubq

With the scalar register it would require 16 instructions to add eight pairs of 128-bit numbers.

Here is a table with a summary of the number of SIMD instructions (called nSIMD) and the number of scalar instructions (called nscalar) necessary to add a number of pairs (called npairs) of 128-bit numbers

              nSIMD      nscalar     npairs
SSE2 + XOP        4           4           2
AVX2              6           8           4
AVX2 + XOP2       4           8           4
AVX-512           4          16           8

Note that XOP2 does not exist yet and I am only speculating that it may exist at some point.

Note also that to do this efficiently the BigNum arrays needs to be stored in an array of struct of array (AoSoA) form. For example using l to mean the lower 64-bits and h to mean the high 64-bits an array of 128-bit integers stores as an array of structs like this

lhlhlhlhlhlhlhlh

should instead be stored using an AoSoA like this

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh

Moved from comment above

It is possible to do this but it is not done because it is not particularly convenient to implement bignums in vector registers.

For the simple task of addition, it is trivial to use the x86 EFLAGS/RFLAGS' register's Carry flag to propagate the addition's carries from the lowest "limb" up (to use the GMP terminology), and loop over an arbitrary amount of limbs laid in an array.

Contrariwise, the lanes of SSE/AVX registers do not have carry flags, which means overflow must be detected in a different way involving comparisons to detect wraparound, which is more computationally intense. Moreover, if an overflow is detected in one limb, it would have to be propagated by an ugly shuffle "upwards", and then added, and this addition may cause another overflow and carry-over, up to N-1 times for an N-limb bignum. Then, once a sum brings a bignum beyond 128-bit/256-bits (or beyond 128 bits x # of registers), you'd have to move it to an array anyways.

Therefore, much special-case code would be needed, and it would not be any faster (in fact, much slower), just for addition. Imagine what it would take for multiplication? or gasp, division?

It's possible, but not practical.

As I said in the other answer, there's no carry flag in AVX/SSE so it's impossible to do addition and subtraction efficiently. And to do multiplications you'll need a lot of shuffling to get the widening multiply result in the desired position.

If you are allowed to work with the newer Haswell/Broadwell microarchitecture, the solution would be MULX in BMI2 and ADOX, ADCX in ADX. You can read about them here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!