AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to tot
I'm 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn't exist is that the implementation cost must outweigh the significant benefit.
Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field that represents combinations of prefixes isn't a bit-field, it's an integer with most of the values unused.
They decided to implement it for AVX512VBMI, though, with larger element sizes available in AVX512BW and AVX512F. Maybe they realized how much it sucked to not have this, and decided to do it anyway. AVX512F takes a lot of die area / transistors to implement, so much that Intel decided not to even implement it in retail desktop CPUs for a couple generations.
(Part of that is that I think these days a lot of code that can take advantage of brand new instruction sets is written to run on known servers, instead of runtime dispatching for use on client machines).
According to Wikipedia, AVX512VBMI isn't coming until Cannonlake, but then we will have vpermi2b, which does 64 parallel table lookups from a 128B table (2 zmm vectors)). Skylake Xeon will only bring vpermi2w
and larger element sizes (AVX512F + AVX512BW).
I'm pretty sure that thirty two 32:1 muxers are a lot more expensive than eight 8:1 muxers, even if the 8:1 muxers are 4x wider. They could implement it with multiple stages of shuffling (rather than a single 32:1 stage), since lane-crossing shuffles get a 3-cycle time budget to get their work done. But still a lot of transistors.
I'd love to see a less hand-wavy answer from someone with hardware design experience. I built a digital timer from TTL counter chips on a breadboard once (and IIRC, read out the counter from BASIC on a TI-99/4A which was very obsolete even ~20 years ago whe), but that's about it.
It's pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time.
Yup. It was the first variable-shuffle, with a control mask from a register instead of an immediate. Looking up a shuffle mask from a LUT of shuffle masks based on a pcmpeqb
/ pmovmskb
result can do some crazy powerful things. @stgatilov's IPv4 dotted-quad -> int converter is one of my favourite examples of awesome SIMD tricks.