Translating SSE to Neon: How to pack and then extract 32bit result

前端未结

关注

 2  702

I have to translate the following instructions from SSE to Neon

 uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );

Where:<

相关标签:

2条回答

鱼传尺愫

2021-01-18 19:37

I would write it as so:

uint32_t extract (uint8x16_t x)
{
  uint8x8x2_t a = vuzp_u8 (vget_low_u8 (x), vget_high_u8 (x));
  uint8x8x2_t b = vuzp_u8 (a.val[0], a.val[1]);
  return vget_lane_u32 (vreinterpret_u32_u8 (b.val[0]), 0);
}

Which on a recent GCC version compiles to:

extract:
    vuzp.8  d0, d1
    vuzp.8  d0, d1
    vmov.32 r0, d0[0]
    bx  lr

0 讨论(0)

鱼传尺愫

2021-01-18 19:42

I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.

uint8x8x2_t   vuzp_u8(uint8x8_t a, uint8x8_t b);

So something like:

uint8x16_t a;
uint8_t* out;
[...]

//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0

vst1q_lane_u32(out,a,0);

Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))

But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):

uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

I have implemented a more general workaround through a flexible data type:

NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

Edit:

Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.

static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);

0 讨论(0)