I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:
Here you go (untested), I’ve tried to explain in the comments what they do.
void func_sse41( const double* left, const double* right, double* res,
const size_t size, double th, double drop )
{
// Verify the size is even.
// If it's not, you'll need extra code at the end to process last value the old way.
assert( 0 == ( size % 2 ) );
// Load scalar values into 2 registers.
const __m128d threshold = _mm_set1_pd( th );
const __m128d dropVec = _mm_set1_pd( drop );
for( size_t i = 0; i < size; i += 2 )
{
// Load 4 double values into registers, 2 from right, 2 from left
const __m128d r = _mm_loadu_pd( right + i );
const __m128d l = _mm_loadu_pd( left + i );
// Compare ( r >= threshold ) for 2 values at once
const __m128d comp = _mm_cmpge_pd( r, threshold );
// Compute ( left[ i ] - drop ), for 2 values at once
const __m128d dropped = _mm_sub_pd( l, dropVec );
// Select either left or ( left - drop ) based on the comparison.
// This is the only instruction here that requires SSE 4.1.
const __m128d result = _mm_blendv_pd( l, dropped, comp );
// Store the 2 result values
_mm_storeu_pd( res, result );
}
}
The code will crash with “invalid instruction” runtime error if the CPU doesn’t have SSE 4.1. For best result, detect with CPU ID to fail gracefully. I think now in 2019 it’s quite reasonable to assume it’s supported, Intel did in 2008, AMD in 2011, steam survey says “96.3%”. If you want to support older CPUs, possible to emulate _mm_blendv_pd with 3 other instructions, _mm_and_pd, _mm_andnot_pd, _mm_or_pd.
If you can guarantee the data is aligned, replacing loads with _mm_load_pd
will be slightly faster, _mm_cmpge_pd compiles into CMPPD https://www.felixcloutier.com/x86/cmppd which can take one of the arguments directly from RAM.
Potentially, you can get further 2x improvement by writing AVX version. But I hope even SSE version is faster than your code, it handles 2 values per iteration, and doesn’t have conditions inside the loop. If you’re unlucky, AVX will be slower, many CPUs need some time to power on their AVX units, takes many thousands of cycles. Until powered, AVX code runs very slowly.
Clang already auto-vectorizes this pretty much the way Soonts suggested doing manually. Use __restrict
on your pointers so it doesn't need a fallback version that works for overlap between some of the arrays. It still auto-vectorizes, but it bloats the function.
Unfortunately gcc only auto-vectorizes with -ffast-math
. It turns out only -fno-trapping-math
is required: that's generally safe especially if you aren't using fenv
access to unmask any FP exceptions (feenableexcept
) or looking at MXCSR sticky FP exception flags (fetestexcept
).
With that option, then GCC too will use (v)pblendvpd
with -march=nehalem
or -march=znver1
. See it on Godbolt
Also, your C function is broken. th
and drop
are scalar double, but you declare them as const double *
AVX512F would let you do a !(right[i] >= thresh)
compare and use the resulting mask for a merge-masked subtract.
Elements where the predicate was true will get left[i] - drop
, other elements will keep their left[i]
value, because you merge info a vector of left
values.
Unfortunately GCC with -march=skylake-avx512
uses a normal vsubpd
and then a separate vmovapd zmm2{k1}, zmm5
to blend, which is obviously a missed optimization. The blend destination is already one of the inputs to the SUB.
Using AVX512VL for 256-bit vectors (in case the rest of your program can't efficiently use 512-bit, so you don't suffer reduced turbo clock speeds):
__m256d left = ...;
__m256d right = ...;
__mmask8 cmp = _mm256_cmp_pd_mask(right, set1(th), _CMP_NGE_UQ);
__m256d res = _mm256_mask_sub_pd (left, cmp, left, set1(drop));
So (besides the loads and store) it's 2 instructions with AVX512F / VL.
And it's more efficient with all compilers because you just need an AND, not a variable-blend. So it's significantly better with just SSE2, and also better on most CPUs even when they do support SSE4.1 blendvpd
, because that instruction isn't as efficient.
You can subtract 0.0
or drop
from left[i]
based on the compare result.
Producing 0.0
or a constant based on a compare result is extremely efficient: just an andps
instruction. (The bit-pattern for 0.0
is all-zeros, and SIMD compares produce vectors of all-1 or all-0 bits. So AND keeps the old value or zeros it.)
We can also add -drop
instead of subtracting drop
. This costs an extra negation on input, but with AVX allows a memory-source operand for vaddpd
. GCC chooses to use an indexed addressing mode so that doesn't actually help reduce the front-end uop count on Intel CPUs, though; it will "unlaminate". But even with -ffast-math
, gcc doesn't do this optimization on its own to allow folding a load. (It wouldn't be worth doing separate pointer increments unless we unroll the loop, though.)
void func3(const double *__restrict left, const double *__restrict right, double *__restrict res,
const size_t size, const double th, const double drop)
{
for (size_t i = 0; i < size; ++i) {
double add = right[i] >= th ? 0.0 : -drop;
res[i] = left[i] + add;
}
}
GCC 9.1's inner loop (without any -march
options and without -ffast-math
) from the Godbolt link above:
# func3 main loop
# gcc -O3 -march=skylake (without fast-math)
.L33:
vcmplepd ymm2, ymm4, YMMWORD PTR [rsi+rax]
vandnpd ymm2, ymm2, ymm3
vaddpd ymm2, ymm2, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdx+rax], ymm2
add rax, 32
cmp r8, rax
jne .L33
Or the plain SSE2 version has an inner loop that's basically the same as with left - zero_or_drop
instead of left + zero_or_minus_drop
, so unless you can promise the compiler 16-byte alignment or you're making an AVX version, negating drop
is just extra overhead.
Negating drop
takes a constant from memory (to XOR the sign bit), and that's the only static constant this function needs, so that tradeoff is worth considering for your case where the loop doesn't run a huge number of times. (Unless th
or drop
are also compile-time constants after inlining, and are getting loaded anyway. Or especially if -drop
can be computed at compile time. Or if you can get your program to work with a negative drop
.)
Another difference between adding and subtracting is that subtracting doesn't destroy the sign of zero. -0.0 - 0.0 = -0.0
, +0.0 - 0.0 = +0.0
. In case that matters.
# gcc9.1 -O3
.L26:
movupd xmm5, XMMWORD PTR [rsi+rax]
movapd xmm2, xmm4 # duplicate th
movupd xmm6, XMMWORD PTR [rdi+rax]
cmplepd xmm2, xmm5 # destroy the copy of th
andnpd xmm2, xmm3 # _mm_andnot_pd
addpd xmm2, xmm6 # _mm_add_pd
movups XMMWORD PTR [rdx+rax], xmm2
add rax, 16
cmp r8, rax
jne .L26
GCC uses unaligned loads so (without AVX) it can't fold a memory source operand into cmppd
or subpd
You can use GCC's and Clang's vector extensions to implement a ternary select function (see https://stackoverflow.com/a/48538557/2542702).
#include <stddef.h>
#include <inttypes.h>
#if defined(__clang__)
typedef double double4 __attribute__ ((ext_vector_type(4)));
typedef int64_t long4 __attribute__ ((ext_vector_type(4)));
#else
typedef double double4 __attribute__ ((vector_size (sizeof(double)*4)));
typedef int64_t long4 __attribute__ ((vector_size (sizeof(int64_t)*4)));
#endif
double4 select(long4 s, double4 a, double4 b) {
double4 c;
#if defined(__GNUC__) && !defined(__INTEL_COMPILER) && !defined(__clang__)
c = s ? a : b;
#else
for(int i=0; i<4; i++) c[i] = s[i] ? a[i] : b[i];
#endif
return c;
}
void func(double* left, double* right, double* res, size_t size, double th, double drop) {
size_t i;
for (i = 0; i<(size&-4); i+=4) {
double4 leftv = *(double4*)&left[i];
double4 rightv = *(double4*)&right[i];
*(double4*)&res[i] = select(rightv >= th, leftv, leftv - drop);
}
for(;i<size; i++) res[i] = right[i] >= th ? left[i] : (left[i] - drop);
}
https://godbolt.org/z/h4OKMl