I have a sparse array a
(mostly zeroes):
unsigned char a[1000000];
and I would like to create an array b
of inde
If you expect number of nonzero elements to be very low (i.e. much less than 1%), then you can simply check each 16-byte chunk for being nonzero:
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(reg, _mm_setzero_si128());
if (mask != 65535) {
//store zero bits of mask with scalar code
}
If percentage of good elements is sufficiently small, the cost of mispredicted branches and the cost of slow scalar code inside 'if' would be negligible.
As for a good general solution, first consider SSE implementation of stream compaction. It removes all zero elements from byte array (idea taken from here):
__m128i shuf [65536]; //must be precomputed
char cnt [65536]; //must be precomputed
int compress(const char *src, int len, char *dst) {
char *ptr = dst;
for (int i = 0; i < len; i += 16) {
__m128i reg = _mm_load_si128((__m128i*)&src[i]);
__m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128());
int mask = _mm_movemask_epi8(zeroMask);
__m128i compressed = _mm_shuffle_epi8(reg, shuf[mask]);
_mm_storeu_si128((__m128i*)ptr, compressed);
ptr += cnt[mask]; //alternative: ptr += 16-_mm_popcnt_u32(mask);
}
return ptr - dst;
}
As you see, (_mm_shuffle_epi8
+ lookup table) can do wonders. I don't know any other way of vectorizing structurally complex code like stream compaction.
Now the only remaining problem with your request is that you want to get indices. Each index must be stored in 4-byte value, so a chunk of 16 input bytes may produce up to 64 bytes of output, which do not fit into single SSE register.
One way to handle this is to honestly unpack the output to 64 bytes. So you replace reg
with constant (0,1,2,3,4,...,15) in the code, then unpack the SSE register into 4 registers, and add a register with four i
values. This would take much more instructions: 6 unpack instructions, 4 adds, and 3 stores (one is already there). As for me, that is a huge overhead, especially if you expect less than 25% of nonzero elements.
Alternatively, you can limit the number of nonzero bytes processed by single loop iteration by 4, so that one register is always enough for output. Here is the sample code:
__m128i shufMask [65536]; //must be precomputed
char srcMove [65536]; //must be precomputed
char dstMove [65536]; //must be precomputed
int compress_ids(const char *src, int len, int *dst) {
const char *ptrSrc = src;
int *ptrDst = dst;
__m128i offsets = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m128i base = _mm_setzero_si128();
while (ptrSrc < src + len) {
__m128i reg = _mm_loadu_si128((__m128i*)ptrSrc);
__m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128());
int mask = _mm_movemask_epi8(zeroMask);
__m128i ids8 = _mm_shuffle_epi8(offsets, shufMask[mask]);
__m128i ids32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(ids8, _mm_setzero_si128()), _mm_setzero_si128());
ids32 = _mm_add_epi32(ids32, base);
_mm_storeu_si128((__m128i*)ptrDst, ids32);
ptrDst += dstMove[mask]; //alternative: ptrDst += min(16-_mm_popcnt_u32(mask), 4);
ptrSrc += srcMove[mask]; //no alternative without LUT
base = _mm_add_epi32(base, _mm_set1_epi32(dstMove[mask]));
}
return ptrDst - dst;
}
One drawback of this approach is that now each subsequent loop iteration cannot start until the line ptrDst += dstMove[mask];
is executed on the previous iteration. So the critical path has increased dramatically. Hardware hyperthreading or its manual emulation can remove this penalty.
So, as you see, there are many variations of this basic idea, all of which solve your problem with different degree of efficiency. You can also reduce size of LUT if you don't like it (again, at the cost of decreasing throughput performance).
This approach cannot be fully extended to wider registers (i.e. AVX2 and AVX-512), but you can try to combine instructions of several consecutive iterations into single AVX2 or AVX-512 instruction, thus slightly increasing throughput.
Note: I didn't test any code (because precomputing LUT correctly requires noticeable effort).