问题
GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization).
A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector.
typedef char v16qi __attribute__ ((vector_size(16)));
static uint8_t checksum(uint8_t *buf, size_t size)
{
assert(size%16 == 0);
uint8_t sum = 0;
vec16qi vec = {0};
for (size_t i=0; i<(size/16); i++)
{
// XXX: Yuck! Is there a better way?
vec += *((v16qi*) buf+i*16);
}
// Sum up the vector
sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];
return sum;
}
Casting a pointer to the vector type appears to work, but I'm worried this might explode in a horrible fashion if SIMD hardware expects the vector types to be correctly aligned.
The only other option I've thought of is use a temp vector and explicitly load the values (via either a memcpy or element-wise assignment), but in testing this counteract most of speedup gained use of SIMD instructions. Ideally I'd imagine this would be something like a generic __builtin_load()
function, but none seems to exist.
What's a safer way of loading data into a vector risking alignment issues?
回答1:
You could use an initializer to load the values, i.e. do
const vec16qi e = { buf[0], buf[1], ... , buf[15] }
and hope that GCC turns this into a SSE load instruction. I'd verify that with a dissassembler, though ;-). Also, for better performance, you try to make buf
16-byte aligned, and inform that compiler via an aligned
attribute. If you can guarantee that the input buffer will be aligned, process it bytewise until you've reached a 16-byte boundard.
回答2:
Edit (thanks Peter Cordes) You can cast pointers:
typedef char v16qi __attribute__ ((vector_size (16), aligned (16)));
v16qi vec = *(v16qi*)&buf[i]; // load
*(v16qi*)(buf + i) = vec; // store whole vector
This compiles to vmovdqa
to load and vmovups
to store. If the data isn't known to be aligned, set aligned (1)
to generate vmovdqu
. (godbolt)
Note that there are also several special-purpose builtins for loading and unloading these registers (Edit 2):
v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned
_mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned
It seems to be necessary to use -flax-vector-conversions
to go from char
s to v16qi
with this function.
See also: C - How to access elements of vector using GCC SSE vector extension
See also: SSE loading ints into __m128
(Tip: The best phrase to google is something like "gcc loading __m128i".)
来源:https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions