问题
Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?
I looked at the intrinsics guide and found _mm256_adds_epi16
but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.
回答1:
A signed overflow will happen if (and only if):
- the signs of both inputs are the same, and
- the sign of the sum (when added with wrap-around) is different from the input
Using C-Operators: overflow = ~(a^b) & (a^(a+b))
.
Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1
trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps
) this can be implemented as:
__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );
// normal result (possibly wraps around)
__m128i res = _mm_add_epi32( a, b );
// If result saturates, it has the same sign as both a and b
__m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
// saturation happened if inputs do not have different signs,
// but sign of result is different:
__m128i sign_xor = _mm_xor_si128( a, b );
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));
return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps(saturated),
_mm_castsi128_ps( res ),
_mm_castsi128_ps( overflow ) ) );
}
If your blendvps
is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min
and int_max
, with the sign-bits of a
.
Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow
) 31 bits to the right, and manual blending (using and/andnot/or).
And naturally, with AVX2 this can take __m256i
variables instead of __m128i
(should be very easy to rewrite).
Addendum If you know the sign of either a
or b
at compile-time, you can directly set saturated
accordingly, and you can save both _mm_xor_si128
calculations, i.e., overflow
would be _mm_andnot_si128(b, res)
for positive a
and _mm_andnot(res, b)
for negative a
(with res = a+b
).
回答2:
This link answers this very question:
https://software.intel.com/en-us/forums/topic/285219
Here's an example implementation:
#include <immintrin.h>
__m128i __inline __mm_adds_epi32( __m128i a, __m128i b )
{
static __m128i int_min = _mm_set1_epi32( 0x80000000 );
static __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );
__m128i res = _mm_add_epi32( a, b );
__m128i sign_and = _mm_and_si128( a, b );
__m128i sign_or = _mm_or_si128( a, b );
__m128i min_sat_mask = _mm_andnot_si128( res, sign_and );
__m128i max_sat_mask = _mm_andnot_si128( sign_or, res );
__m128 res_temp = _mm_blendv_ps(_mm_castsi128_ps( res ),
_mm_castsi128_ps( int_min ),
_mm_castsi128_ps( min_sat_mask ) );
return _mm_castps_si128(_mm_blendv_ps( res_temp,
_mm_castsi128_ps( int_max ),
_mm_castsi128_ps( max_sat_mask ) ) );
}
void addSaturate(int32_t* bufferA, int32_t* bufferB, size_t numSamples)
{
//
// Load and add
//
__m128i* pSrc1 = (__m128i*)bufferA;
__m128i* pSrc2 = (__m128i*)bufferB;
for(int i=0; i<numSamples/4; ++i)
{
__m128i res = __mm_adds_epi32(*pSrc1, *pSrc2);
_mm_store_si128(pSrc1, res);
pSrc1++;
pSrc2++;
}
}
来源:https://stackoverflow.com/questions/29498824/add-saturate-32-bit-signed-ints-intrinsics