Optimizing Array Compaction

前端 未结 5 940
耶瑟儿~
耶瑟儿~ 2021-02-07 01:52

Let\'s say I have an array k = [1 2 0 0 5 4 0]

I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0]

Using only the mask m and

5条回答
  •  野的像风
    2021-02-07 02:13

    So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.

    for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
     if(mask[inIdx] == 1) {
      out[outIdx] = in[inIdx];
      outIdx++;
     }
    }
    

    If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.

    So you make 16 Cases:

    • [1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] don't need any special shift/shuffle you just copy the input to the output MOVDQU and increment the output pointer by 4, 3, 2, 1, 0 respectively.
    • [0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] you just need to use PSRLx (shift right logical) and increment the output pointer by 3, 2, 2, 1, 1, 1 respectively
    • [1 0 0 1], [1 0 1 0], [0 1 0 1], [1 0 1 1], [1 1 0 1] you use the PSHUFD to pack your input then increment your output pointer by 2, 2, 2, 3, 3 respectively.

    So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.

提交回复
热议问题