Keep only the 10 useful bits in 16-bit words
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.