Let\'s say I have an array
k = [1 2 0 0 5 4 0]
I can compute a mask as follows
m = k > 0 = [1 1 0 0 1 1 0]
Using only the mask m and
So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.
for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
if(mask[inIdx] == 1) {
out[outIdx] = in[inIdx];
outIdx++;
}
}
If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.
So you make 16 Cases:
So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.