问题
There are two ways of implementation of accumulation in sse intrinsic. But one of them gets the wrong result.
#include <smmintrin.h>
int main(int argc, const char * argv[]) {
int32_t A[4] = {10, 20, 30, 40};
int32_t B[8] = {-1, 2, -3, -4, -5, -6, -7, -8};
int32_t C[4] = {0, 0, 0, 0};
int32_t D[4] = {0, 0, 0, 0};
__m128i lv = _mm_load_si128((__m128i *)A);
__m128i rv = _mm_load_si128((__m128i *)B);
// way 1 unexpected
rv += lv;
_mm_store_si128((__m128i *)C, rv);
// way 2 expected
rv = _mm_load_si128((__m128i *)B);
rv = _mm_add_epi32(lv, rv);
_mm_store_si128((__m128i *)D, rv);
return 0;
}
expected result is:
9 22 27 36
C is:
9 23 27 37
D is:
9 22 27 36
回答1:
In GNU C, __m128i
is defined as a vector of 64-bit integers, with something like
typedef long long __m128i __attribute__((vector_size(16), may_alias));
Using GNU C native vector syntax (the +
operator) does a per-element add with 64-bit element size. i.e. _mm_add_epi64
.
In your case, carry-out from the top of one 32-bit element added an extra one to the 32-bit element above it, because 64-bit element size does propagate carry between pairs of 32-bit elements. (Adding a negative to a non-zero destination produces a carry-out.)
The Intel intrinsics API doesn't define the +
operator for __m128
/ __m128d
/ __m128i
. Your code won't compile on MSVC, for example.
So the behaviour you're getting is only from the implementation details of intrinsic types in GCC's headers. It's useful for float vectors where there is an obvious element size, but for integer vectors you'd want to define your own unless you do happen to have 64-bit integers.
If you want to be able to use v1 += v2;
you can define your own GNU C native vector types, like
typedef uint32_t v4ui __attribute__((vector_size(16), aligned(4)));
Note I left out the may_alias
, so it's only safe to cast pointers to unsigned
, not to read arbitrary data like char[]
.
In fact GCC's emmintrin.h
(SSE2) does define a bunch of types:
/* SSE2 */
typedef double __v2df __attribute__ ((__vector_size__ (16)));
typedef long long __v2di __attribute__ ((__vector_size__ (16)));
typedef unsigned long long __v2du __attribute__ ((__vector_size__ (16)));
typedef int __v4si __attribute__ ((__vector_size__ (16)));
typedef unsigned int __v4su __attribute__ ((__vector_size__ (16)));
typedef short __v8hi __attribute__ ((__vector_size__ (16)));
typedef unsigned short __v8hu __attribute__ ((__vector_size__ (16)));
typedef char __v16qi __attribute__ ((__vector_size__ (16)));
typedef unsigned char __v16qu __attribute__ ((__vector_size__ (16)));
I'm not sure if they're intended for external use.
GNU C native vectors are most useful when you want to get the compiler to emit efficient code for division by a compile-time constant, or something like that. e.g. digit = v1 % 10;
and v1 /= 10;
with 16-bit unsigned integers will compile to pmulhuw
and a right shift. But they're also just handy for readable code.
There are some C++ wrapper libraries that portably provide operator overloads, and have types like Vec4i
(4x signed int) / Vec4u
(4x unsigned int) / Vec16c
(16x signed char) to give you a type system for different kinds of integer vectors, so you know what you're getting from v1 += v2;
or v1 >>= 2;
(Right shifts are one case where the signedness matters.)
e.g. Agner Fog's VCL (GPL license), or DirectXMath (MIT license).
来源:https://stackoverflow.com/questions/56572357/why-does-gives-me-unexpected-result-in-sse-instrinsic