问题
I have been using the following "trick" in C code with SSE2 for single precision floats for a while now:
static inline __m128 SSEI_m128shift(__m128 data)
{
return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4);
}
For data like [1.0, 2.0, 3.0, 4.0]
, it results in [2.0, 3.0, 4.0, 0.0]
, i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least).
I am somehow failing to wrap my head around doing the same with AVX2. How could I achieve this in an efficient manner?
Similar questions: 1, 2, 3
来源:https://stackoverflow.com/questions/61971583/left-shift-of-float32-array-with-avx2-and-filling-up-with-a-zero