问题
- They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement)
- Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size
- BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question?
- You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with AVX512, which, btw, include NAND and XNOR........................
- Optimizing compilers could generate better code
- It gets way worse with SIMD => there is no NOT instruction, which requires tripling of execution time, dependency chain length (EDIT: <= not true; thanks @Peter Cordes) and code size instead of doubling:
vpcmpeqd xmm15, xmm15, xmm15
vpor xmm0, xmm0, xmm1
vpandn xmm0, xmm0, xmm15
回答1:
Those instructions wouldn't be as valuable as you imagine, and once a base ISA has been created, architects typically don't add new instructions unless there's a big win for some important use-case. (e.g. MMX isn't a big win overall for most code, but was a huge speedup for video/audio codecs as one of the early use-cases.)
Remember, most code isn't doing branchless bithacks. That only became much more common with SIMD, decades after 8086. I doubt most programmers would rather have nor
than or
(8086 had no space left for more standard ALU instruction encodings that follow its normal patterns1.) A lot of code spends a lot of it's time comparing-and-branching, looping over data structures (and stalling for memory), or doing "normal" math. Certainly bit-manipulation code exists, but a lot of code doesn't involve much of that.
Saving an instruction or two all over the place will help, but only if you can compile your whole application with these new instructions. (Although most of BMI1 and BMI2 are actually like that, e.g. SHLX/SHRX for 1-uop copy-and-shift-by-variable, but Intel still added them to patch over the really crappy 3-uop shift-by-cl.) That's fine if you're targeting a specific server (so you can build with -march=native
), but a lot of x86 code is ahead-of-time compiled for use on random consumer machines. Extensions like SSE can greatly speed up single loops, so it's usually viable to dispatch to different versions of one single function to take advantage, while keeping the baseline requirement low.
But it wouldn't work that way for newly-added version of the instructions you're suggesting, so the benefit to adding them is significantly lower. And they weren't already present because 8086 is super cramped.
But most ISAS don't have these, not ARM, not even PowerPC which chooses to use the coding space in its 32-bit instruction words to have a lot of opcodes. (Including neat stuff like rlwinm
rotate and mask with a bit-range, and other bitfield insert/extract to arbitrary position stuff.) So it's not just a matter of 8086 legacy screwing x86-64 yet again, it's that most CPU architects haven't considered it worth adding opcodes for these, even in a RISC with lots of space.
Although MIPS does have a nor
, instead of a not
. (MIPS xori
zero-extends the immediate so it couldn't be used to NOT a full register.)
SIMD code:
Note that once you've created an all-ones vector once, you can reuse it in a loop. Most SIMD code is in loops, although careful use of SIMD for a single struct can be good.
SIMD NOT only adds 1 cycle to the critical path, for a total of 2 cycle latency for your NOR implementation. In your example, pcmpeqd
is off the critical path and has no dependency on the old value of the reg on almost all CPUs. (Still needs a SIMD execution unit to write the ones, though). It costs throughput but not latency. Execution time might depend on either throughput or latency, for a given block of code. (How many CPU cycles are needed for each assembly instruction? (it's not that simple) / What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
BTW, compilers often use vpxor
with all-ones instead of vpandn
; only advantage is with a memory source operand where you can NOT-and-load with xor, unlike vpandn where the optionally-memory operand (src2) is the one that's not inverted. dst = ~src1 & src2
.
Scalar code
You can often arrange your code to not need an inversion, e.g. checking the opposite FLAG condition after an OR. Not always; of course when you're doing a chain of bitwise things it can come up, probably moreso with SIMD.
The real speedup from adding more such instructions to BMI1 or a future extension probably would (have been) quite small for most general workloads like SPECint.
More valuable than integer xnor
etc. probably would be non-destructive VEX versions of common integer instructions like sub
that can't be done with LEA. So lots of mov
/sub
sequences could be vsub
. Also maybe imul
, or
, maybe and
, and perhaps shl
/shr
/sar
-immediate. But sure if you're adding stuff, might as well have nand, nor, and xnor. And maybe scalar abs
, and setcc r/m32
to avoid the stupid xor
-zeroing or movzx
you need to booleanize into a 32-bit integer. (While you're at it, mov r/m32, sign_extended_imm8
would also be good for code-density if you could find a one-byte opcode for it, e.g. one of the ones that 64-bit mode freed up.)
There's a whole laundry list of bad or short-sighted design decisions it would be nice to reverse (or that it would have been nice if AVX fixed), e.g. that cvtsi2sd xmm0, eax
merges into XMM0 so it has a false dependency, leading GCC to spend an extra insn on xor-zeroing the destination. AVX was a chance to change that behaviour for the VEX version, and maybe could have been handled internally by giving the existing execution unit the physical zero-reg as the merge target. (Which exists in the physical register file on SnB-family, that's why xor-zeroing can be fully eliminated in rename, like mov-elimination.) But nope, Intel kept everything as much like the legacy-SSE versions as the possibly could, preserving that short-sighted Pentium III design decision. :( (PIII split xmm regs into two 64-bit halves: only writing the low half was good for it for SSE1 cvtsi2ss
. Intel continued with the merging for SSE2 cvtsi2sd
in P4 for consistency I guess.)
It might have made sense to add negated-boolean instruction in some SIMD version before AVX-512, like SSE4.1 (which added a bunch of miscellaneous integer stuff, and made things more orthogonal, and was added. And was only added in 45nm Core2, so transistor budgets were a lot higher than in MMX or SSE1/2 days), or AVX (which opened up a lot of coding space with VEX).
But since they didn't, there's little point adding them now that vpternlogd
exists. Unless Intel is going to create new legacy-SSE or 256-bit-only VEX extensions that AMD might want to implement...
(Legacy-SSE would make it usable even in their Silvermont-family CPUs, and in Pentium/Celeron CPUs, none of which decode VEX prefixes. That's why unfortunately even Skylake Pentiums disable BMI1/2 support along with AVX1/2/FMA. This is really dumb and means we're no closer to being able to use BMI1/2 as a baseline for ahead-of-time compiled stuff that should run on "modern desktops".)
Opcode coding space
VEX has lots of coding space, and mask instructions use that. Also, AVX-512 is only implemented by high-end CPUs; it will be a long time if ever before Intel's low-power Silvermont family CPUs implement it. So needing to decode all those different VEX-coded mask instructions is something AVX-512 CPUs just have to deal with.
AVX-512 (or a predecessor) was originally designed for Larrabee, a GPU project which turned into Xeon Phi compute cards. So AVX-512 ISA-design choices don't fully reflect what you might design with general-purpose usage in mind. Although having lots of relatively small cores would mean you'd want to avoid anything that inflated decoder die-area or power too much, so it's not unreasonable.
But without VEX, x86 opcode space is very crowded (literally no 1-byte opcodes left in 32-bit mode, and few 0f xx
left. http://ref.x86asm.net/coder32.html). Intel (unlike AMD) still for some reason likes to make some CPUs that can't decode VEX prefixes. Of course they could change that and add VEX decoding into Silvermont so they could have VEX-coded integer instructions without supporting AVX (or all of BMI2). (BMI2 includes pext/pdep which are expensive to implement fast in a dedicate execution unit. AMD chooses to micro-code them so they've very slow, but that lets code use other BMI2 instructions usefully.)
(Unfortunately there's no way for a CPU to advertize (via CPUID) that it supports only 128-bit vector size AVX instructions, which would have allowed narrower CPUs to still get non-destructive instructions. OTOH, without some forward-compatible way for code to use wider instructions on CPUs that do support it, making 128-bit AVX code to optimize for current CPUs might end up being called "good enough" and not have anyone bother to make 256=bit versions for CPUs that can support it.)
Footnote 1: opcodes for original-8086 instructions
Just getting every different opcode decoded was a challenge for 8086, and each ALU instruction has about 8 different opcodes: memory dest, memory source, immediate source, and special-case no modrm AL/AX forms. And times two for 8 and 16-bit versions of each of those. Plus xnor r/m16, sign_extended_imm8
. Of course the immediate forms can use the /r
field in ModRM as extra opcode bits, but xnor r/m8, r
and xnor r, r/m8
and the 16-bit forms would need 4 separate opcode bytes, and so would xnor al, imm8
and xnor ax, imm16
, so that's 6 whole opcode bytes per instruction, plus some overloaded opcode /constant
(semi-related: https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/160739#160739 re: short-form AL,imm8 encodings.)
Part of the patterns you can see in the original-8086 opcodes is that one bit selects between r/m
destination vs. r/m
source, and another bit between 8 and 16-bit operand-size (Is there a pattern to x86 op codes? (other than direction and size bits) / Are x86 opcodes arbitrary?). So doing it differently for a few rarer instructions (by leaving out memory-dst or 8-bit forms for example) might have broken the pattern and if so needed more extra transistors than the standard patterns for feeding the ALU after a load or register fetch, or load/alu/store.
In fact, I don't think 8086 left enough room for even one more ALU instruction that supported all the standard forms like add
or or
. And 8086 didn't decode any 0f xx
opcodes; that came later for extensions.
来源:https://stackoverflow.com/questions/65596423/why-are-there-no-nand-nor-and-xnor-instructions-in-x86