How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

前端 未结 1 1801
醉梦人生
醉梦人生 2021-01-05 03:46

I have implemented an inline function (_mm256_concat_epi16). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I w

相关标签:
1条回答
  • 2021-01-05 04:20

    It's impossible to give a general answer to this question. It's such a short fragment that the best strategy depends on the surrounding code and what CPU you're running on.

    Sometimes we can rule out things that have no advantages on any CPU and just consume more of the same resources, but that's not the case when considering a tradeoff between unaligned loads vs. shuffles.


    In a loop over a possibly-misaligned input array, you're probably best off using unaligned loads. Especially your input array will be aligned at runtime most of the time. If not, and it's a problem, then if possible do an unaligned first vector and then aligned from the first alignment boundary. I.e. the usual tricks for a prologue that gets to an alignment boundary for the main loop. But with multiple pointers, it's usually best to align your store pointer, and do unaligned loads (according to Intel's optimization manual), if your pointers are misaligned relative to each other. (See Agner Fog's optimization guides and other links in the x86 tag wiki.)

    On recent Intel CPUs, vector loads that cross a cache-line boundary still have pretty good throughput, but this is one reason why you might consider an ALU strategy, or a mix of shuffles and overlapping loads (in an unrolled loop you might alternate strategies so you don't bottleneck on either one).


    As Stephen Canon points out in _mm_alignr_epi8 (PALIGNR) equivalent in AVX2 (a possible duplicate of this), if you need several different offset windows into the same concatenation of two vectors, then two stores + repeated unaligned loads is excellent. On Intel CPUs, you get 2-per-clock throughput for 256b unaligned loads as long as they don't cross a cache-line boundary (so alignas(64) your buffer).

    Store/reload is not great for the single-use case, though, because of store-forwarding failure for a load that isn't fully contained within either store. It's still cheap for throughput, but expensive for latency. Another huge advantage is that it's efficient with a runtime-variable offset.

    If latency is an issue, using ALU shuffles can be good (especially on Intel where lane-crossing shuffles aren't a lot more expensive than in-lane). Again, think about / measure what your loop bottlenecks on, or just try store/reload vs. ALU.


    The shuffle strategy:

    Your current function can only compile if indx is known at compile time (because palignr needs the byte-shift-count as an immediate).

    As @Mohammad suggested, you could pick from different shuffles at compile time, depending on the indx value. He seemed to be suggesting a CPP macro, but that would be ugly.

    Much easier to simply use if(indx>=16) or something like that, which will optimize away. (You could make indx a template parameter if a compiler refused to compile your code with an apparently "variable" shift count.) Agner Fog uses this in his Vector Class Library (license=GPL), for functions like template <uint32_t d> static inline Vec8ui divide_by_ui(Vec8ui const & x).

    Related: Emulating shifts on 32 bytes with AVX has an answer with different shuffle strategies depending on shift count. But it's only trying to emulate a shift, not a concat / lane-crossing palignr.

    vperm2i128 is fast on Intel mainstream CPUs (but still a lane-crossing shuffle so 3c latency), but slow on Ryzen (8 uops with 3c latency/3c throughput). If you were tuning for Ryzen, you'd want to use an if() to figure out a combination of vextracti128 to get a high lane and/or vinserti128 on a low lane. You might also want to use separate shifts and then vpblendd the results together.


    Designing the right shuffles:

    The indx determines where the new bytes for each lane need to come from. Let's simplify by considering 64-bit elements:

     hi |  lo
    D C | B A    # a
    H G | F E    # b
    
    palignr(b,a i) forms (H G D C) >> i | (F E B A) >> i
    But what we want is
    
    D C | B A    # concatq(b,a,0): no-op.  return a;
    
    E D | C B    # concatq(b,a,1):  applies to 16-bit element counts from 1..7
              low lane needs  hi(a).lo(a)
              high lane needs lo(b).hi(a)
            return palignr(swapmerge(a,b), a, 2*i).  (Where we use vperm2i128 to lane-swap+merge hi(a) and lo(b))
    F E | D C    # concatq(b,a,2)
            special case of exactly half reg width: Just use vperm2i128.
            Or on Ryzen, `vextracti128` + `vinserti128`
    G F | E D    # concatq(b,a,3): applies to 16-bit element counts from 9..15
            low  lane needs lo(b).hi(a)
            high lane needs hi(b).lo(b).  vperm2i128 -> palignr looks good
            return palignr(b, swapmerge(a,b), 2*i-16).
    
    H G | F E    # concatq(b,a,4): no op: return b;
    

    Interestingly, lo(b) | hi(a) is used in both palignr cases. We never need lo(a) | hi(b) as a palignr input.

    These design notes lead directly to this implementation:

    // UNTESTED
    // clang refuses to compile this, but gcc works.
    
    // in many cases won't be faster than simply using unaligned loads.
    static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b, unsigned int count) {
    #endif
       if (count == 0)
         return a;
       else if (count <= 7)
         return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,count*2);
       else if (count == 8)
          return _mm256_permute2x128_si256(a,b,0x21);
       else if (count > 8 && count <= 15)
         // clang chokes on the negative shift count even when this branch is not taken
         return _mm256_alignr_epi8(b,_mm256_permute2x128_si256(a,b,0x21),count*2 - 16);
       else if (count == 16)
         return b;
       else
         assert(0 && "out-of-bounds shift count");
    
    // can't get this to work without C++ constexpr :/
    //   else
    //     static_assert(count <= 16, "out-of-bounds shift count");
    }
    

    I put it on the Godbolt compiler explorer with some test functions that inline it with different constant shift counts. gcc6.3 compiles it to

    test_alignr0:
        ret            # a was already in ymm0
    test_alignr3:
        vperm2i128      ymm1, ymm0, ymm1, 33   # replaces b
        vpalignr        ymm0, ymm1, ymm0, 6
        ret
    test_alignr8:
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret
    test_alignr11:
        vperm2i128      ymm0, ymm0, ymm1, 33   # replaces a
        vpalignr        ymm0, ymm1, ymm0, 6
        ret
    test_alignr16:
        vmovdqa ymm0, ymm1
        ret
    

    clang chokes on it. First, it says error: argument should be a value from 0 to 255 for the count*2 - 16 for counts that don't use that branch of the if/else chain.

    Also, it can't wait and see that the alignr() count ends up being a compile-time constant: error: argument to '__builtin_ia32_palignr256' must be a constant integer, even when it is after inlining. You can solve that in C++ by making count a template parameter:

    template<unsigned int count>
    static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b) {
       static_assert(count<=16, "out-of-bounds shift count");
       ...
    

    In C, you could make it a CPP macro instead of a function to deal with that.

    The count*2 - 16 problem is harder to solve for clang. You could make the shift count part of the macro name, like CONCAT256_EPI16_7. There's probably some CPP trickery you could use to do the 1..7 versions and the 9..15 versions separately. (Boost has some crazy CPP hacks.)


    BTW, your print function is weird. It calls the first element c[1] instead of c[0]. Vector indices start at 0 for shuffles, so it's really confusing.

    0 讨论(0)
提交回复
热议问题