问题
I'm on the Intel Intrinsic site and I can't figure out what combination of instructions I want. What I'd like to do is
result = high_table[i8>>4] & low_table[i8&15]
Where both table are 16bits (or more). shuffle seems like what I want (_mm_shuffle_epi8) however getting a 8bit value doesn't work for me. There doesn't seem to be a 16bit version and the non byte version seems to need the second param as an immediate value.
How am I suppose to implement this? Do I call _mm_shuffle_epi8 twice for each table, cast it to 16bits and shift the value by 8? If so which cast and shift instruction do I want to look at?
回答1:
To split your incoming indices into two vectors of nibbles, you want the usual bit-shift and AND. SSE doesn't have 8-bit shifts, so you have to emulate with a wider shift and an AND to mask away bits that shifted into the top of your bytes. (Because unfortunately for this use-case _mm_shuffle_epi8
does not ignore the high bits. If the top selector bit is set it zeros that output element.)
You definitely do not want to widen your incoming i8
vector to 16-bit elements; that would not be usable with _mm_shuffle_epi8
.
AVX2 has vpermd
: select dwords from a vector of 8x 32-bit elements. (only 3-bit indices so it's not good for your use-case unless your nibbles are only 0..7). AVX512BW has wider shuffles, including vpermi2w
to index into a table of the concatenation of two vectors, or just vpermw
to index words.
But for 128-bit vectors with just SSSE3, yeah pshufb
(_mm_shuffle_epi8
) is the way to go. You'll need two separate vectors for high_table
, one for the upper byte and one for the lower byte of each word entry. And another two vectors for the halves of low_table.
Use _mm_unpacklo_epi8
and _mm_unpackhi_epi8
to interleave the low 8 bytes of two vectors, or the high 8 bytes of two vectors. That will give you the 16-bit LUT results you want, with the upper half in each word coming from the high-half vector.
i.e. you're building a 16-bit LUT out of two 8-bit LUTs with this interleave. And you're repeating the process twice for two different LUTs.
The code would look something like
// UNTESTED, haven't tried even compiling this.
// produces 2 output vectors, you might want to just put this in a loop instead of making a helper function for 1 vector.
// so I'll omit actually returning them.
void foo(__m128i indices)
{
// these optimize away, only used at compile time for the vector initializers
static const uint16_t high_table[16] = {...},
static const uint16_t low_table[16] = {...};
// each LUT needs a separate vector of high-byte and low-byte parts
// don't use SIMD intrinsics to load from the uint16_t tables and deinterleave at runtime, just get the same 16x 2 x 2 bytes of data into vector constants at compile time.
__m128i high_LUT_lobyte = _mm_setr_epi8(high_table[0]&0xff, high_table[1]&0xff, high_table[2]&0xff, ... );
__m128i high_LUT_hibyte = _mm_setr_epi8(high_table[0]>>8, high_table[1]>>8, high_table[2]>>8, ... );
__m128i low_LUT_lobyte = _mm_setr_epi8(low_table[0]&0xff, low_table[1]&0xff, low_table[2]&0xff, ... );
__m128i low_LUT_hibyte = _mm_setr_epi8(low_table[0]>>8, low_table[1]>>8, low_table[2]>>8, ... );
// split the input indexes: emulate byte shift with wider shift + AND
__m128i lo_idx = _mm_and_si128(indices, _mm_set1_epi8(0x0f));
__m128i hi_idx = _mm_and_si128(_mm_srli_epi32(indices, 4), _mm_set1_epi8(0x0f));
__m128i lolo = _mm_shuffle_epi8(low_LUT_lobyte, lo_idx);
__m128i lohi = _mm_shuffle_epi8(low_LUT_hibyte, lo_idx);
__m128i hilo = _mm_shuffle_epi8(high_LUT_lobyte, hi_idx);
__m128i hihi = _mm_shuffle_epi8(high_LUT_hibyte, hi_idx);
// interleave results of LUT lookups into vectors 16-bit elements
__m128i low_result_first = _mm_unpacklo_epi8(lolo, lohi);
__m128i low_result_second = _mm_unpackhi_epi8(lolo, lohi);
__m128i high_result_first = _mm_unpacklo_epi8(hilo, hihi);
__m128i high_result_second = _mm_unpackhi_epi8(hilo, hihi);
// first 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
__m128i and_first = _mm_and_si128(low_result_first, high_result_first);
// second 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
__m128i and_second = _mm_and_si128(low_result_second, high_result_second);
// TOOD: do something with the results.
}
You could AND before interleaving, high halves against high halves and low against low. That might be somewhat better for instruction-level parallelism, letting execution of the ANDs overlap with the shuffles. (Intel Haswell through Skylake has only 1/clock throughput for shuffles.)
Choosing variable names is a struggle with stuff like this. Some people just give up and use non-meaningful names for some intermediate steps.
来源:https://stackoverflow.com/questions/61436326/how-do-i-vectorize-data-i160-to-15