问题
I'm trying to implement the strategy described in an answer to How do I vectorize data_i16[0 to 15]? Code below. The spot I'd like to fix is the for(int i=0; i<ALIGN; i++)
loop
I'm new to SIMD. From what I can tell I'd load the high/low nibble table by writing
const auto HI_TBL = _mm_load_si128((__m128i*)HighNibble)
const auto LO_TBL = _mm_load_si128((__m128i*)LowNibble)
My problem is the >>4
and tbl[index]
.
It seems like I can't do a shift on bytes (_mm_srai_epi16) so I need to convert everything to 16bits. Ok fine I can use two unpacks (_mm_unpacklo_epi8/_mm_unpackhi_epi8) with zeroes as the second param and I'll have two sets of variables to shift. However, the shuffle seems to be only available for 8bits (_mm_shuffle_epi8) AND it shuffles only 8bits while I need 16.
As you can see I'll need to do a lot of instructions so I get the feeling I'm doing this wrong. I'm also unsure how to go from 16bits (after I right shift by 4) to 8. Maybe I missed it but is there a 128bit rotate right? Then I could skip the unpack. (using vectorOf15=_mm_broadcastb_epi8(15) and _mm_and_si128(rotateResult, vectorOf15)?)
Heres a non vectorized demo below
#include <stdio.h>
#include <string.h>
typedef unsigned char u8;
typedef unsigned short u16;
typedef signed char s8;
typedef signed short s16;
#define ALIGN 16
#define ALIGN_ATTR __attribute__ ((aligned(ALIGN)))
u16 HighNibble[16] ALIGN_ATTR = {0, 0, 0, 1, 512, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0};
u16 LowNibble[16] ALIGN_ATTR = {1, 513, 513, 513, 513, 513, 513, 1, 1, 1, 0, 0, 0, 0, 0, 0};
char my_input[1024*1024] ALIGN_ATTR;
u16 my_output[1024*1024] ALIGN_ATTR;
int main(int argc, char *argv[])
{
strcpy(my_input, "09AZaz.fFgG"); //Digits will become 1 and A-F/a-f will become 512
auto input_end = my_input+sizeof(my_input);
auto output_end = my_output+sizeof(my_output);
auto output = my_output;
for(auto input=my_input; input<input_end; input+=ALIGN)
{
for(int i=0; i<ALIGN; i++)
{
auto val = input[i];
output[i]=HighNibble[val>>4] & LowNibble[val&15];
}
output+=ALIGN;
}
for(int i=0; i<11; i++) //We only care about the first few we set using strcpy
printf("%d\n", my_output[i]);
return 0;
}
来源:https://stackoverflow.com/questions/61446596/how-do-i-efficiently-lookup-16bits-in-a-128bit-simd-vector