AVX2 what is the most efficient way to pack left based on a mask?

后端 未结 5 1156
不知归路
不知归路 2020-11-22 06:37

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in

5条回答
  •  情歌与酒
    2020-11-22 06:56

    If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).

    I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.

    Here is the intrinsics version, along with code to build LUT.

    //Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
    __m256i MoveMaskToIndices(u32 moveMask) {
        u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
        __m256i indices = _mm256_set1_epi32(*reinterpret_cast(adr));//lower 24 bits has our LUT
    
       // __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
    
        //now shift it right to get 3 bits at bottom
        //__m256i shufmask = _mm256_srli_epi32(m, 29);
    
        //Simplified version suggested by wim
        //shift each lane so desired 3 bits are a bottom
        //There is leftover data in the lane, but _mm256_permutevar8x32_ps  only examines the first 3 bits so this is ok
        __m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
        return shufmask;
    }
    
    u32 get_nth_bits(int a) {
        u32 out = 0;
        int c = 0;
        for (int i = 0; i < 8; ++i) {
            auto set = (a >> i) & 1;
            if (set) {
                out |= (i << (c * 3));
                c++;
            }
        }
        return out;
    }
    u8 g_pack_left_table_u8x3[256 * 3 + 1];
    
    void BuildPackMask() {
        for (int i = 0; i < 256; ++i) {
            *reinterpret_cast(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
        }
    }
    

    Here is the assembly generated by MSVC:

      lea ecx, DWORD PTR [rcx+rcx*2]
      lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
      vpbroadcastd ymm0, DWORD PTR [rcx+rax]
      vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm@00000015000000120000000f0000000c00000009000000060000000300000000
      
    

提交回复
热议问题