发表新帖

发表新帖

Optimizing Array Compaction

前端未结

关注

 5  940

耶瑟儿～ 2021-02-07 01:52

Let\'s say I have an array k = [1 2 0 0 5 4 0]

I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0]

Using only the mask m and

5条回答

野的像风 (楼主)

2021-02-07 02:13
So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.
```
for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
 if(mask[inIdx] == 1) {
  out[outIdx] = in[inIdx];
  outIdx++;
 }
}
```
If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.

So you make 16 Cases:
- [1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] don't need any special shift/shuffle you just copy the input to the output MOVDQU and increment the output pointer by 4, 3, 2, 1, 0 respectively.
- [0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] you just need to use PSRLx (shift right logical) and increment the output pointer by 3, 2, 2, 1, 1, 1 respectively
- [1 0 0 1], [1 0 1 0], [0 1 0 1], [1 0 1 1], [1 1 0 1] you use the PSHUFD to pack your input then increment your output pointer by 2, 2, 2, 3, 3 respectively.
So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题