SSE 4 popcount for 16 8-bit values?

南楼画角 提交于 2019-12-04 06:27:53

SSE 4 popcount for 16 8-bit values can be done in parallel this way:

#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>

//----------------------------------------------------------------------------
//
// parallelPopcnt16bytes - find population count for 8-bit groups in xmm (16 groups)
//                         each byte of xmm result contains a value ranging from 0 to 8
//
static __m128i parallelPopcnt16bytes (__m128i xmm)
   {
    const __m128i mask4 = _mm_set1_epi8 (0x0F);
    const __m128i lookup = _mm_setr_epi8 (0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
   __m128i low, high, count;

   low = _mm_and_si128 (mask4, xmm);
   high = _mm_and_si128 (mask4, _mm_srli_epi16 (xmm, 4));
   count = _mm_add_epi8 (_mm_shuffle_epi8 (lookup, low), _mm_shuffle_epi8 (lookup, high));
   return count;
   }

//----------------------------------------------------------------------------

int main (void)
    {
    int index;
    __m128i testVector = _mm_set_epi8 (1, 2, 4, 8, 16, 32, 64, 128, 0, 1, 3, 7, 15, 31, 63, 127);
    __m128i counts = parallelPopcnt16bytes (testVector);

    printf ("population count for each byte:");
    for (index = 15; index >= 0; index--)
        {
        uint8_t *bytes = (void *) &counts;
        printf (" %d", bytes [index]);
        }
    printf ("\n");
    return 0;
    }

//----------------------------------------------------------------------------

popcnt was introduced simultaneously with the SSE4.2 ISA extension but does not operate on SSE vector registers. You will need a separate instruction for each individual result.

Furthermore it's not defined for 8-bit operands. You will need to pad to 16 bits if you need a count for each individual byte.

You could sum 8 bytes at a time in 64-bit registers, but that doesn't sound like what you're after.

Reference: The SSE4 manual.

SSE2 solution.

I haven't tested this, but you could AND the SSE register with 0x80808080… to get a 16-byte mask of all 1's or all 0's. Repeat for all 8 bits in a byte, and sum the masks. Since all 1's represents -1 in two's complement, negate the 16 bytes, and you have all the results.

The AND and comparison operations should be able to run in parallel. The chain of additions is dependent but it should still run plenty fast, and it fits in 32 instructions. (Only 7 additions needed.)

/* init */
__m128i bit0 = _mm_set1_epi8( 0x80 );
__m128i mask0 = _mm_and_si128( in, bit0 );
__m128i sum = _mm_cmpeq_epi8( mask0, _mm_setzero_si128() );

/* general pattern */
__m128i bit1 = _mm_set1_epi8( 0x40 );
__m128i mask1 = _mm_and_si128( in, bit1 );
mask1 = _mm_cmpeq_epi8( mask1, _mm_setzero_si128() );
sum = _mm_add_epi8( sum, mask1 );

/* next bit */
__m128i bit2 = _mm_set1_epi8( 0x20 );
__m128i mask2 = _mm_and_si128( in, bit2 );
mask2 = _mm_cmpeq_epi8( mask2, _mm_setzero_si128() );
sum = _mm_add_epi8( sum, mask2 );

...

/* finish up */
sum = _mm_sub_epi8( _mm_setzero_si128(), sum );
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!