Optimizing Array Compaction

前端 未结 5 947
耶瑟儿~
耶瑟儿~ 2021-02-07 01:52

Let\'s say I have an array k = [1 2 0 0 5 4 0]

I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0]

Using only the mask m and

相关标签:
5条回答
  • 2021-02-07 02:13

    So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.

    for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
     if(mask[inIdx] == 1) {
      out[outIdx] = in[inIdx];
      outIdx++;
     }
    }
    

    If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.

    So you make 16 Cases:

    • [1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] don't need any special shift/shuffle you just copy the input to the output MOVDQU and increment the output pointer by 4, 3, 2, 1, 0 respectively.
    • [0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] you just need to use PSRLx (shift right logical) and increment the output pointer by 3, 2, 2, 1, 1, 1 respectively
    • [1 0 0 1], [1 0 1 0], [0 1 0 1], [1 0 1 1], [1 1 0 1] you use the PSHUFD to pack your input then increment your output pointer by 2, 2, 2, 3, 3 respectively.

    So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.

    0 讨论(0)
  • 2021-02-07 02:16

    Reading the comments below the original question, in the actual problem the array contains 32-bit floating point numbers, and the mask is (one?) 32-bit integer, so I don't get it why shifts etc. should be used for compacting the array. The simple compacting algorithm (in C) would be something like this:

    float array[8];
    unsigned int mask = ...;
    int a = 0, b = 0;
    while (mask) {
      if (mask & 1) { array[a++] = array[b]; }
      b++;
      mask >>= 1;
    }
    /* Size of compacted array is 'a' */
    /* Optionally clear the rest: */
    while (a < 8) array[a++] = 0.0;
    

    Minor variations would be due to the bit order of the mask, but the only ALU operations that are needed are index variable updates and shifting and ANDing the mask. Because the original array is at least 256 bits wide, no usual CPU can shift the whole array bit-wise around.

    0 讨论(0)
  • 2021-02-07 02:16

    Assuming what you want is to store only positive integers from an array with minimum steps in C++ this is a sample code:

    int j = 0;
    int arraysize = (sizeof k)/4;
    int store[arraysize];
    for(int i = 0; i<arraysize; i++)
    {
        if(k[i] > 0)
        {
            store[j] = k[i];
            j++;
        }
    }
    

    Or you can directly use elements of k[ ] if you don't want to use for loop.

    0 讨论(0)
  • 2021-02-07 02:20

    Original code moves array element only one step at a time. This may be improved. It is possible to group array elements and shift them 2^k steps at once.

    First part of this algorithm computes how many steps should each element be shifted. Second part moves elements - first by one step, then by 2, then 4, etc. This works correctly and elements are not intermixed because after each shift there is enough space to perform 2 times larger shift.

    Matlab, code not tested:

    function out = compact( in )
        m = in <= 0
        for i = 1:size(in, 2)-1
            m = [0 m(1:end-1)]
            s = s + m
        end
    
        d = in
        shift = 1
        for j = 1:ceil(log2(size(in, 2)))
            s1 = rem(s, 2)
            s = (s - s1) / 2
            d = (d .* ~s1) + ([d(1+shift:end) zeros(1,shift)] .* [s1(1+shift:end) zeros(1,shift)])
            shift = shift*2
        end
        out = d
    end
    

    The above algorithm's complexity is O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift)).

    0 讨论(0)
  • 2021-02-07 02:26

    There is no much to optimize in the original pseudo-code. I see several small improvements here:

    • loop may perform one iteration less (i.e. size-1),
    • if 'use' is zero, you may break the loop early,
    • use = (m == 0) & (ml == 1) probably may be simplified to use = ~m & ml,
    • if ~ is counted as separate operation, it would be better to use the inverted form : use = m | ~ml, d = ~use .* dl + use .* d, use_r = [1 use(1:end-1)], d = d .*use_r

    But it is possible to invent better algorithms. And the choice of algorithm depends on CPU resources used:

    • Load-Store Unit, i.e. apply algorithm directly to memory words. Nothing can be done here until chipmakers add highly parallel SCATTER instruction to their instruction sets.
    • SSE registers, i.e. algorithms working on entire 16 bytes of the registers. Algorithms like the proposed pseudo-code cannot help here because we already have various shuffle/permute instructions which make the work better. Using various compare instructions with PMOVMSKB, grouping the result by 4 bits and applying various shuffle instructions under switch/case (as described by LastCoder) is the best we can do.
    • SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB.
    • Integer registers, i.e. GPR registers or working simultaneously on several DWORD/QWORD parts of SSE/AVX registers (which allows to perform several independent compactions). The proposed pseudo-code applied to integer registers allows to compact binary subsets of any length (from 2 to 20 bits). Here is my algorithm, which is likely to perform better.

    C++, 64 bit, subset width = 8:

    typedef unsigned long long ull;
    const ull h = 0x8080808080808080;
    const ull l = 0x0101010101010101;
    const ull end = 0xffffffffffffffff;
    
    // uncompacted bytes
    ull x = 0x0100802300887700;
    
    // set hi bit for zero bytes (see D.Knuth, volume 4)
    ull m = h & ~(x | ((x|h) - l));
    
    // bitmask for nonzero bytes
    m = ~(m | (m - (m>>7)));
    
    // tail zero bytes need no special treatment
    m |= (m - 1);
    
    while (m != end)
    {
      ull tailm = m ^ (m + 1); // bytes to be processed
      ull tailx = x & tailm; // get the bytes
      tailm |= (tailm << 8); // shift 1 byte at a time
      m |= tailm; // all processed bytes are masked
      x = (x ^ tailx) | (tailx << 8); // actual byte shift
    }
    
    0 讨论(0)
提交回复
热议问题