I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a
Here's a version that I think is slightly better than the accepted answer if you target icc:
__m256d copysign_pd(__m256d from, __m256d to) {
__m256d const avx_sigbit = _mm256_set1_pd(-0.);
return _mm256_or_pd(_mm256_and_pd(avx_sigbit, from), _mm256_andnot_pd(avx_sigbit, to));
}
It uses _mm256_set1_pd
rather than an broadcast intrinsic. On clang and gcc this is mostly a wash, but on icc the broadcast version actually writes a constant to the stack and then broadcasts from it, which is ... terrible.
Godbolt showing AVX-512 code, adjust the -march=
to -march=skylake
to see AVX2 code.
Here's an untested AVX-512 version which uses vpterlogdq
directly, which compiles down to a single vpterlogd
instruction on icc and clang (gcc includes a separate broadcast):
__m512d copysign_pd_alt(__m512d from, __m512d to) {
const __m512i sigbit = _mm512_castpd_si512(_mm512_set1_pd(-0.));
return _mm512_castsi512_pd(_mm512_ternarylogic_epi64(_mm512_castpd_si512(from), _mm512_castpd_si512(to), sigbit, 0xE4));
}
You could make a 256-bit version of this for when AVX-512 is enabled but you're dealing with __m256*
vectors.