I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction
It's highly unlikely that AVX2 will beat the mulx instruction which does 64bx64b to 128b in one instruction. There is one exception I'm aware of large multiplications using floating point FFT.
However, if you don't need exactly 64bx64b to 128b you could consider 53bx53b to 106b using double-double arithmetic.
To multiply four 53-bit numbers a
and b
to get four 106-bit number only needs two instructions:
__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);
This gives four 106-bit numbers in two instructions compared to one 128-bit number in one instruction using mulx
.