How can I count the occurrence of a byte in array using SIMD?

问题

Given the following input bytes:

var vBytes = new Vector<byte>(new byte[] {72, 101, 55, 08, 108, 111, 55, 87, 111, 114, 108, 55, 100, 55, 55, 20});

And the given mask:

var mask = new Vector<byte>(55);

How can I find the count of byte 55 in the input array?

I have tried xoring the vBytes with the mask:

var xored = Vector.Xor(mask, vBytes);

which gives:

<127, 82, 0, 91, 91, 88, 0, 96, 88, 69, 91, 0, 83, 0, 0, 35>

But don't know how I can get the count from that.

For the sake of simplicity let's assume that the input byte length is always equal to the size of Vector<byte>.Count.

回答1:

Thanks to Marc Gravell for his tip, the following works:

var areEqual = Vector.Equals(vBytes, mask);
var negation = Vector.Negate(areEqual);
var count = Vector.Dot(negation, Vector<byte>.One);

Marc has a blog post with more info on the subject.

回答2:

In asm, you want pcmpeqb to produce a vector of 0 or 0xFF. Treated as signed integers, that's 0/-1.

Then use the compare-result as integers values with psubb to add 0 / 1 to the counter for that element. (Subtract -1 = add +1)

That can overflows after 256 iterations, so sometime before that, use psadbw against _mm_setzero_si128() to horizontally sum those unsigned bytes (without overlow) into 64-bit integers (one 64-bit integer per group of 8 bytes). Then paddq to accumulate 64-bit totals.

Accumulating before you overflow can be done with a nested loop, or just at the end of a regular unrolled loop. psadbw is fast (because it's a key building block for video encoding motion-search), so it's not bad to just accumulate every 4 compares, or even every 1 and skip the psubb.

See Agner Fog's optimization guides for more details on x86. According to his instruction tables, psadbw xmm / vpsadbw ymm runs at 1 vector per clock cycle on Skylake, with 3 cycle latency. (Only 1 uop of front-end bandwidth.) All the instructions mentioned above are also single-uop, and run on more than one port (so don't necessarily conflict with each other for throughput). Their 128-bit versions only require SSE2.

If you really only have one vector at a time to count, and aren't looping over memory, then probably pcmpeqb / psadbw / pshufd (copy high half to low) / paddd / movd eax, xmm0 gives you 255 * number of matches in an integer register. One extra vector instruction (like subtract from zero, or AND with 1, or pabsb (absolute value) would remove the x255 scale factor.

IDK how to write that in C# SIMD, but you definitely do not want a dot-product! Unpack and convert to FP would be about 4x slower than the above, just from the fact that a fixed-width vector holds 4x more bytes than floats, and dpps (_mm_dp_ps) is not fast. 4 uops, and one per 1.5 cycle throughput on Skylake. If you do have to horizontal-sum something other than unsigned bytes, see Fastest way to do horizontal float vector sum on x86 (my answer also include integer).

Or if Vector.Dot uses pmaddubsw / pmaddwd for integer vectors, then that might not be as bad, but doing a multi-step horizontal sum for each vector of compare results is just bad compared to psadbw, or especially to byte accumulators that you only horizontal sum occasionally.

Or if C# optimizes out any actual multiplying with a constant vector of 1. Anyway, the first part of this answer is the code you want the CPU to be running. Make that happen however you like using whatever source code gets it to happen.

回答3:

Here a fast SSE2 implementation in C:

size_t memcount_sse2(const void *s, int c, size_t n) {
   __m128i cv = _mm_set1_epi8(c), sum = _mm_setzero_si128(), acr0,acr1,acr2,acr3;
    const char *p,*pe;                                                                         
    for(p = s; p != (char *)s+(n- (n % (252*16)));) { 
      for(acr0 = acr1 = acr2 = acr3 = _mm_setzero_si128(),pe = p+252*16; p != pe; p += 64) { 
        acr0 = _mm_add_epi8(acr0, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)p))); 
        acr1 = _mm_add_epi8(acr1, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+16)))); 
        acr2 = _mm_add_epi8(acr2, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+32)))); 
        acr3 = _mm_add_epi8(acr3, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+48))));
        __builtin_prefetch(p+1024);
      }
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr0), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr1), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr2), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr3), _mm_setzero_si128()));
    }

    // may require SSE4, rewrite this part for actual SSE2.
    size_t count = _mm_extract_epi64(sum, 0) + _mm_extract_epi64(sum, 1);

    // scalar cleanup.  Could be optimized.
    while(p != (char *)s + n) count += *p++ == c;
    return count;
}

and see: https://gist.github.com/powturbo for and avx2 implementation.

来源：https://stackoverflow.com/questions/49552656/how-can-i-count-the-occurrence-of-a-byte-in-array-using-simd

标签

.net

simd

system.numerics