I have a __m128i
variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128
and _mm_slli_si128
work, but on bits i
This is the best that I could come up with for left/right immediate shifts with SSE2:
#include <stdio.h>
#include <emmintrin.h>
#define SHL128(v, n) \
({ \
__m128i v1, v2; \
\
if ((n) >= 64) \
{ \
v1 = _mm_slli_si128(v, 8); \
v1 = _mm_slli_epi64(v1, (n) - 64); \
} \
else \
{ \
v1 = _mm_slli_epi64(v, n); \
v2 = _mm_slli_si128(v, 8); \
v2 = _mm_srli_epi64(v2, 64 - (n)); \
v1 = _mm_or_si128(v1, v2); \
} \
v1; \
})
#define SHR128(v, n) \
({ \
__m128i v1, v2; \
\
if ((n) >= 64) \
{ \
v1 = _mm_srli_si128(v, 8); \
v1 = _mm_srli_epi64(v1, (n) - 64); \
} \
else \
{ \
v1 = _mm_srli_epi64(v, n); \
v2 = _mm_srli_si128(v, 8); \
v2 = _mm_slli_epi64(v2, 64 - (n)); \
v1 = _mm_or_si128(v1, v2); \
} \
v1; \
})
int main(void)
{
__m128i va = _mm_setr_epi8(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f);
__m128i vb, vc;
vb = SHL128(va, 4);
vc = SHR128(va, 4);
printf("va = %02vx\n", va);
printf("vb = %02vx\n", vb);
printf("vc = %02vx\n", vc);
printf("\n");
vb = SHL128(va, 68);
vc = SHR128(va, 68);
printf("va = %02vx\n", va);
printf("vb = %02vx\n", vb);
printf("vc = %02vx\n", vc);
return 0;
}
Test:
$ gcc -Wall -msse2 shift128.c && ./a.out
va = 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
vb = 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0
vc = 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0 00
va = 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
vb = 00 00 00 00 00 00 00 00 00 10 20 30 40 50 60 70
vc = 90 a0 b0 c0 d0 e0 f0 00 00 00 00 00 00 00 00 00
$
Note that the SHL128/SHR128 macros are implemented using a gcc extension supported by gcc, clang and some other compilers, but these will need to be adapted if your compiler does not support this extension.
Note also that the printf extension for SIMD types used in the test harness works with Apple gcc, clang, et al, but again if your compiler does not support this and you want to test the code you'll need to implement your own SIMD print routines.
Note on performance - the if/else branch will get optimised out so long as n
is a compile-time constant (which it needs to be anyway for the shift intrinsics) so you have 2 instructions for the n >= 64 case and 4 instructions for the n < 64 case.