In designing forward looking algorithms for AVX256, AVX512 and one day AVX1024 and considering the potential implementation complexity/cost of fully generic permutes for lar
Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it's not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w
and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles.
Shuffle throughput very easily becomes a bottleneck if you aren't careful on recent Intel CPUs (only one shuffle execution unit on port 5). Sometimes it's even worth using two overlapping loads instead of loading once and shuffling, i.e. using an unaligned load as a shuffle, because L1D cache is fast, and so is load-port handling of unaligned loads. (Less so with AVX512, though, especially because every unaligned 512b load is automatically a cache-line split, because vectors and cache lines are both 64 bytes.)
There are also rotate (new in AVX512) and shift instructions (not new). The 64-bit element size versions can move data between smaller elements if you use a shift or rotate count of 32 or 16, for example. vprolq zmm, zmm, 32
is 1c latency and runs on port 0 (and also port1 for the xmm/ymm versions), swapping every element with it's neighbour. Shifts/rotates don't compete for port 5 on SKX.
For a horizontal sum, the only real choice is what order to shuffle in. Usually start with extract
/ add
down to 128b, then use __m128
shuffles (or integer shifts), instead of using vpermd/q
for every shuffle. Or if you want the result broadcast to all elements, use in-lane shuffles between the first few adds, and then shuffle in 128b then 256b chunks with lane-crossing shuffles. (Shuffling in 128b chunks isn't faster than smaller granularity immediate-control shuffles like vpermq z,z,imm8
on SKX, but that's all you need for an hsum after doing the in-lane stuff with vshufps
or vpermilps
.)
On KNL, it's not lanes, it's 1-source vs. 2-source shuffles that matter.
e.g. vshufps dst, same,same, imm8
is half the throughput of vpermilps dst, src, imm8
.
1-source shuffles with a vector control like vpermd v,v,v
are still fast, though (1 source + 1 shuffle-control vector).
Even when they're only 1 uop, the 4-7c latency shuffles (2-input) have worse than 2c throughput. I guess that means KNL's shuffle unit isn't fully pipelined.
Note that some future AMD CPUs will probably split 512b ops into two 256b ops (or possibly four 128b ops). Lane-crossing shuffles are significantly more expensive there. Even vperm2f128
on Ryzen is 8 uops, 3c lat / 3c throughput, vs. 1 uop on SKL. In-lane shuffles obviously decompose into 1 uop per lane fairly easily, but lane-crossing doesn't.
Raw data
For Skylake-AVX512 (aka Skylake-X, aka SKL-SP or SKX), there are InstLatx64 (Instruction throughput/Latency) results, and IACA supports it. InstLatx64 has a spreadsheet (ODS OpenOffice/LibreOffice format) combining data from IACA (just uop count and ports), and published by Intel in a PDF (throughput/latency), and from real experimental testing on real hardware (throughput/latency). Check the main InstLat page for updates.
Agner Fog's instruction tables have data for Knight's Landing Xeon Phi (KNL), and there's a section about it's Silvermont-based microarchitecture in his microarch PDF. He unfortunately doesn't have SKX test results yet.
KNL instructions have better latency if their input is coming from the same execution unit (e.g. shuffle -> shuffle) vs. FMA -> shuffle. (See the note at the top of Agner's spreadsheet). This is what the 4-7c latency numbers are about. A transpose or something doing a chain of shuffles might see mostly the lower latency number. (But KNL has generally high latencies, which is why it has 4-way hyperthreading to try to hide them).
All lane crossing shuffles are at best 1 uop, 3c latency, 1c throughput. But even complex/powerful ones like 2-input vpermt2ps
are that fast. This includes all shuffles that shuffle whole lanes, or insert/extract 256b chunks.
All in-lane-only shuffles are 1c latency (except for the xmm version of some new-in-avx512 lane-crossing shuffles). So use vpshufd zmm, zmm, imm8
or vpunpcklqdq zmm, zmm, zmm
when that's all you need. Or vpshufb
or vpermilps
with a vector control input.
Like Haswell and SKL (non-avx512), SKX can only run shuffle uops on port 5. Again like those earlier CPUs, it can broadcast-load using only the load ports, so that's just as cheap as a regular vector load. AVX512 broadcast loads can micro-fuse, making memory-source broadcasts cheaper (in shuffle throughput terms) than register source.
Even vmovsldup ymm, [mem]
/ vmovshdup ymm, [mem]
use just a load uop for the 256b shuffle. IDK about 512b; Instlat didn't test memory-source movsl/hdup, so we only have Agner Fog's data. (And IIRC I confirmed that on my own SKL).
Note that when running 512b instructions, the vector ALUs on port 1 are disabled, so you have a max throughput of 2 vector ALU uops per clock. (But p1 can still run integer stuff.) And vector load/store uops don't need p0 / p5, so you can still bottleneck on the front-end (4 uops per clock issue/rename) in code with a mix of non-fused loads, stores, and ALU (and integer loop overhead, and vmovdqa register copying handled in the rename stage with unfused-domain uop).
Exceptions to the rule on SKX:
VPMOVWB ymm, zmm
and similar truncate or signed/unsigned saturate instructions are 2 uops, 4c latency. (Or 2c for the xmm versions). vpmovqd
is 1 uop, 3c (or 1c xmm) latency, because its smallest granularity is dword and it's only truncating, not saturating, so it can be implemented internally with the same hardware that's needed for pshufb
for example. vpmovz/sx
instructions are still only 1 uop.
vpcompressd/q
(left-pack based on a mask) is 2 uops (p5), 3c latency. (Or 6c according to what Intel publishes; maybe Instlat is testing the vector->vector latency and Intel is giving the k
register -> vector latency? Unlikely that it's data-dependent and faster with a trivial mask.) vpexpandd
is also 2 uops.
AVX512BW vpermt2w
/ vpermi2w
is 3 uops (p0 + 2p5), 7c latency for all 3 operand sizes (xmm/ymm/zmm). Small-granularity wide shuffles are expensive in hardware (See Where is VPERMB in AVX2? including the comments). This is a 2-source 16-bit-element shuffle with the control in a 3rd vector. It might get faster eventually in future generations, the way pshufb
(and all full-register shuffles with granularity smaller than 8 bytes) was slow in first-gen Core2 Conroe/Merom, but got fast in the die-shrink next generation (Penryn).
AVX512BW vpermw
(one-source lane-crossing word shuffle) is 2p5, 6c latency, 2c throughput because it's a lane-crossing word shuffle.
expect AVX512VBMI vpermt2b
to be as bad or worse on Cannonlake, even if Cannonlake does improve vpermt2w
/ vpermw
.
vpermt2d/q/ps/pd
are all efficient in SKX because their granularity is dword (32-bit) or wider. (But still apparently 3c latency for the xmm version, so they didn't build separate hardware to speed up the one-lane version). These are even more powerful than a lane-crossing shufps
: a variable control and with no limitation on which source register each element comes from. It's a fully general 2-source shuffle where you index into the concatenation of 2 registers, overwriting the index (vpermi2*
) or one of the tables (vpermt2*
). There's only one intrinsic because the compiler handles register allocation and copying to preserve still-needed values.
Shuffles run on the FP0 port only, but front-end throughput is only 2 uops per clock. So more of your total instructions can be shuffles without bottlenecking on that (vs. SKX), unless they're half-throughput shuffles.
In general, 2-input shuffles like vperm2f128
/vshuff32x4
or vshufps
are 2c throughput / 4-7c latency, while 1-input shuffles like vpermd
are 1c throughput / 3-6c latency. (i.e. 2 inputs occupies the shuffle unit for an extra cycle (half throughput) and costs 1 extra cycle of latency). Agner isn't clear on exactly what the effect of the not-fully-pipelined shuffles is, but I assume it just ties up the shuffle unit, and not everything on port FP0 (like the FMA unit).
Lane-crossing or not makes no difference on KNL, e.g. vpermilps
and vpermps
are both fast (1c throughput, 3-6c latency), but vpermi2ps
and vshufps
are both slow (2c throughput, 4-7c latency). I don't see any exceptions to that for instructions where KNL supports an AVX512 version. (i.e. not counting AVX2 vpshufb
, i.e. pretty much anything with 32-bit or larger granularity).
vinserti32x4
and so on (insert/extract with granularity of at least 128b) is a 2-input shuffle for insert, but is fast: 3-6c lat / 1c tput. But extract-to-memory is multiple uops and causes a decode bottleneck: e.g. VEXTRACTF32X4 m128,z
is 4 uops, one per 8c throughput. (mostly because of decode).
vcompress/ps/d
, vpcompressd/q
and v[p]expandd/q/ps/pd
are 1 uop, 3-6c latency. (vs. 2 uops on SKX). But throughput is only one per 3c: Agner doesn't indicate whether this ties up the whole shuffle unit for 2c, or if only this part is not fully pipelined.
AVX2 byte/word shuffles are very slow for 256b operand-size: pshufb xmm
is 5 uops / 10c throughput, vpshufb ymm
is 12 uops / 12c throughput. (MMX pshufb mm
is 1 uop, 2-6c latency, 1c throughput, so I guess the byte-granularity shuffle unit is 64b wide.)
pshuflw xmm
is 1 uop fast, but vpshuflw ymm
is 4 uops, 8c throughput.
Video encoding on KNL might be barely worth it with 128-bit AVX (vpsadbw xmm
is fast), but AVX2 ymm instructions are generally slower than using more 1 uop xmm instructions.
movss/sd xmm,xmm
is a blend, not a shuffle, and has 0.5c throughput / 2c latency.
vpunpcklbw / wd are super slow (except the xmm version), but DQ and QDQ are the regular speed even for ymm / zmm operand size. (2c throughput / 4-7c latency, because it's a 2-input shuffle).
vpmovzx
is 3c latency (not 3-6c?) and 2c throughput even for vpmovzxbw
. vpmovsx
is slower: 2 uops and thus a decode bottleneck, making it 8c latency and 7c throughput. The narrowing truncate instructions (vpmovqb
and so on) are 1 uop, 3c lat / 1c tput, but the narrowing saturate instructions are 2 uops and thus slow. Agner didn't test them with a memory destination.