I have implemented an inline function (_mm256_concat_epi16
). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I w
It's impossible to give a general answer to this question. It's such a short fragment that the best strategy depends on the surrounding code and what CPU you're running on.
Sometimes we can rule out things that have no advantages on any CPU and just consume more of the same resources, but that's not the case when considering a tradeoff between unaligned loads vs. shuffles.
In a loop over a possibly-misaligned input array, you're probably best off using unaligned loads. Especially your input array will be aligned at runtime most of the time. If not, and it's a problem, then if possible do an unaligned first vector and then aligned from the first alignment boundary. I.e. the usual tricks for a prologue that gets to an alignment boundary for the main loop. But with multiple pointers, it's usually best to align your store pointer, and do unaligned loads (according to Intel's optimization manual), if your pointers are misaligned relative to each other. (See Agner Fog's optimization guides and other links in the x86 tag wiki.)
On recent Intel CPUs, vector loads that cross a cache-line boundary still have pretty good throughput, but this is one reason why you might consider an ALU strategy, or a mix of shuffles and overlapping loads (in an unrolled loop you might alternate strategies so you don't bottleneck on either one).
As Stephen Canon points out in
_mm_alignr_epi8 (PALIGNR) equivalent in AVX2 (a possible duplicate of this), if you need several different offset windows into the same concatenation of two vectors, then two stores + repeated unaligned loads is excellent. On Intel CPUs, you get 2-per-clock throughput for 256b unaligned loads as long as they don't cross a cache-line boundary (so alignas(64)
your buffer).
Store/reload is not great for the single-use case, though, because of store-forwarding failure for a load that isn't fully contained within either store. It's still cheap for throughput, but expensive for latency. Another huge advantage is that it's efficient with a runtime-variable offset.
If latency is an issue, using ALU shuffles can be good (especially on Intel where lane-crossing shuffles aren't a lot more expensive than in-lane). Again, think about / measure what your loop bottlenecks on, or just try store/reload vs. ALU.
The shuffle strategy:
Your current function can only compile if indx
is known at compile time (because palignr
needs the byte-shift-count as an immediate).
As @Mohammad suggested, you could pick from different shuffles at compile time, depending on the indx
value. He seemed to be suggesting a CPP macro, but that would be ugly.
Much easier to simply use if(indx>=16)
or something like that, which will optimize away. (You could make indx
a template parameter if a compiler refused to compile your code with an apparently "variable" shift count.) Agner Fog uses this in his Vector Class Library (license=GPL), for functions like template <uint32_t d>
static inline Vec8ui divide_by_ui(Vec8ui const & x).
Related: Emulating shifts on 32 bytes with AVX has an answer with different shuffle strategies depending on shift count. But it's only trying to emulate a shift, not a concat / lane-crossing palignr
.
vperm2i128
is fast on Intel mainstream CPUs (but still a lane-crossing shuffle so 3c latency), but slow on Ryzen (8 uops with 3c latency/3c throughput). If you were tuning for Ryzen, you'd want to use an if()
to figure out a combination of vextracti128
to get a high lane and/or vinserti128
on a low lane. You might also want to use separate shifts and then vpblendd
the results together.
Designing the right shuffles:
The indx
determines where the new bytes for each lane need to come from. Let's simplify by considering 64-bit elements:
hi | lo
D C | B A # a
H G | F E # b
palignr(b,a i) forms (H G D C) >> i | (F E B A) >> i
But what we want is
D C | B A # concatq(b,a,0): no-op. return a;
E D | C B # concatq(b,a,1): applies to 16-bit element counts from 1..7
low lane needs hi(a).lo(a)
high lane needs lo(b).hi(a)
return palignr(swapmerge(a,b), a, 2*i). (Where we use vperm2i128 to lane-swap+merge hi(a) and lo(b))
F E | D C # concatq(b,a,2)
special case of exactly half reg width: Just use vperm2i128.
Or on Ryzen, `vextracti128` + `vinserti128`
G F | E D # concatq(b,a,3): applies to 16-bit element counts from 9..15
low lane needs lo(b).hi(a)
high lane needs hi(b).lo(b). vperm2i128 -> palignr looks good
return palignr(b, swapmerge(a,b), 2*i-16).
H G | F E # concatq(b,a,4): no op: return b;
Interestingly, lo(b) | hi(a)
is used in both palignr
cases. We never need lo(a) | hi(b)
as a palignr input.
These design notes lead directly to this implementation:
// UNTESTED
// clang refuses to compile this, but gcc works.
// in many cases won't be faster than simply using unaligned loads.
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i b, unsigned int count) {
#endif
if (count == 0)
return a;
else if (count <= 7)
return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,count*2);
else if (count == 8)
return _mm256_permute2x128_si256(a,b,0x21);
else if (count > 8 && count <= 15)
// clang chokes on the negative shift count even when this branch is not taken
return _mm256_alignr_epi8(b,_mm256_permute2x128_si256(a,b,0x21),count*2 - 16);
else if (count == 16)
return b;
else
assert(0 && "out-of-bounds shift count");
// can't get this to work without C++ constexpr :/
// else
// static_assert(count <= 16, "out-of-bounds shift count");
}
I put it on the Godbolt compiler explorer with some test functions that inline it with different constant shift counts. gcc6.3 compiles it to
test_alignr0:
ret # a was already in ymm0
test_alignr3:
vperm2i128 ymm1, ymm0, ymm1, 33 # replaces b
vpalignr ymm0, ymm1, ymm0, 6
ret
test_alignr8:
vperm2i128 ymm0, ymm0, ymm1, 33
ret
test_alignr11:
vperm2i128 ymm0, ymm0, ymm1, 33 # replaces a
vpalignr ymm0, ymm1, ymm0, 6
ret
test_alignr16:
vmovdqa ymm0, ymm1
ret
clang chokes on it. First, it says error: argument should be a value from 0 to 255
for the count*2 - 16
for counts that don't use that branch of the if
/else
chain.
Also, it can't wait and see that the alignr()
count ends up being a compile-time constant: error: argument to '__builtin_ia32_palignr256' must be a constant integer
, even when it is after inlining. You can solve that in C++ by making count
a template parameter:
template<unsigned int count>
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i b) {
static_assert(count<=16, "out-of-bounds shift count");
...
In C, you could make it a CPP macro instead of a function to deal with that.
The count*2 - 16
problem is harder to solve for clang. You could make the shift count part of the macro name, like CONCAT256_EPI16_7. There's probably some CPP trickery you could use to do the 1..7 versions and the 9..15 versions separately. (Boost has some crazy CPP hacks.)
BTW, your print function is weird. It calls the first element c[1]
instead of c[0]
. Vector indices start at 0 for shuffles, so it's really confusing.