So, since this topic seems to interest you, let me give you an overview. An x86 instruction comprises up to five parts and is up to 15 bytes long:
prefixes opcode operand displacement immediate
It is possible to generate encodings that are longer than 15 bytes, but the CPU rejects them. All five parts except for the opcode are optional. You can find their length as follows:
- an instruction can have any number of legacy prefixes. These are:
f0
lock, f2
repne, f3
repe, 2e
cs, 36
ss, 3e
ds, 26
es, 64
fs, 65
gs, 66
operand size override, and 67
address size override. However, only one of f0
, f2
, f3
and only one of 26
, 2e
, 36
, 3e
, 64
, and 65
is recognized at a time. If more than one prefix from each group is provided, CPUs behave differently. VEX and EVEX encoded instructions may only have the segment override and address size override legacy prefixes as the other prefixes are subsumed under the VEX and EVEX prefixes.
- In long mode (and only there), an instruction may have a REX prefix immediately after all legacy prefixes. The REX prefix is one of
40
to 4f
. In other modes, these bytes are instructions, not prefixes and your decoder must account for that. As with legacy prefixes, a VEX or EVEX encoded instruction cannot have a REX prefix.
- The bytes
c4
and c5
can introduce a VEX prefix used to encode some modern instructions. In long mode, they always do, but in other modes, you have to check the byte afterwards: Interprete it as a modr/m byte, if it encodes an r,r
operand pair, it's a VEX prefix, otherwise its the opcode for les
or lds
. A VEX prefix beginning with c4
is two bytes long, with c5
it's three bytes. The VEX prefix also encodes the 0f
, 0f 38
and 0f 3a
opcode prefixes which are omitted in a VEX encoded instruction. Note that generally, using a VEX prefix is not optional. For example, pdep
is encoded as VEX.NDS.LZ.F2.0F38.W0 F5 /r
(e.g. c4 e2 7b f5 c0
for pdep eax,eax,eax
) but the corresponding legacy instruction f2 0f 38 f5 r/m32
(e.g. f2 0f 38 f5 c0
for pdep eax,eax
) is invalid. Note that the same opcode can exist with a VEX prefix and without and the two can mean different things. For example, 0f 77
is emms
but VEX.128.0F.WIG 77
(i.e. c5 f8 77
) is vzeroupper
.
- The byte
62
introduces an EVEX prefix which is used to encode AVX512 instructions. Similar to the VEX prefix, the next few bytes need to be checked to distinguish an EVEX prefix from the bound
instruction. The EVEX prefix is always four bytes long and encodes part of the opcode just as the VEX prefix does.
After the prefixes, the opcode follows. Originally, the opcode was always a single byte but then they ran out of space, so now it's either a single byte or a single byte prefixed by 0f
, 0f 38
, or 0f 3a
. These prefixes are absent if the instruction is VEX encoded. Note that some prefixes may change what instruction is encoded. For example, opcode 0f b8
is jmpe
(Enter IA-64 mode) but f3 0f b8
is not repe jmpe
but rather popcnt
.
The opcode and the prefixes decide which instruction is encoded. From here on, it's mostly smooth sailing. Depending on the instruction, a modr/m byte may follow. Depending on the modr/m byte and the address override prefix, a sib byte and one, two, or four displacement bytes may follow. Finally, depending on the instruction, the operand size override prefix, and the REX prefix, one, two, four, six, or eight immediate bytes may follow.
That's about as much of a description as I can give in the scope of a Stack Overflow answer. So TL;DR: It's really complicated.