This relates to this question
Thinking about it though, on a modern intel CPU the SEC phase is implemented in microcode meaning there would be a check whereby a burn
Intel has patented some very assembly-like functionality for microcode, which includes:
Execution from L1, L2 or L3(!!!!!!!!!!!!!!!!!!!!!!!). Heck, they patented loading a "big" microcode update from mass storage into L3 and then updating from there... -- note that "patented" and "implemented" are distinct, I have no idea if they have currently implemented anything else than execution from L1.
Opcode and Ucode(!) sections in the MCU package (unified microprocessor update) -- the thing we call "microcode update" but really has/can have all sort of stuff inside, including PMU firmware updates, MCROM patches, uncore parameter changes, PWC firmware, etc, that get executed before/after the processor firmware/ucode update procedure.
Subroutine-like behavior including parameters on the Ucode. Conditional branching, or at least conditional loops, they've had for quite a while.
Compression and uncompression of the microcode (unknown if it can be "run" from compressed state directly, but the patent seems to imply it would at least be used to optimize the MCU package).
And WRMSR/RDMSR really are more like an RPC into Ucode than anything else nowadays, which I suppose got really helpful when they find out they need a new MSR, or to do a complex change on an architectural MSR behavior (like the LAPIC base register, which had to be "gatekeeped" to work around the LAPIC memory sinkhole SMM security hole that made the news a few years ago).
So, just look at it as a hardware-accelerated turing-complete RISC machine that implements the "public" instruction architecture.
Microcode branches are apparently special.
Intel's P6 and SnB families do not support dynamic prediction for microcode branches, according to Andy Glew's description of original P6 (What setup does REP do?). Given the similar performance of SnB-family rep
-string instructions, I assume this PPro fact applies to even the most recent Skylake / CoffeeLake CPUs1.
But there is a penalty for microcode branch misprediction, so they are statically(?) predicted. (This is why rep movsb
startup cost goes in increments of 5 cycles for low/medium/high counts in ECX, and aligned vs. misaligned.)
A microcoded instruction takes a full line to itself in the uop cache. When it reaches the front of the IDQ, it takes over the issue/rename stage until it's done issuing microcode uops. (See also How are microcodes executed during an instruction cycle? for more detail, and some evidence from perf event descriptions like idq.dsb_uops
that show the IDQ can be accepting new uops from the uop cache while the issue/rename stage is reading from the microcode-sequencer.)
For rep
-string instructions, I think each iteration of the loop has to actually issue through the front-end, not just loop inside the back-end and reuse those uops. So this involves feedback from the OoO back-end to find out when the instruction is finished executing.
I don't know the details of what happens when issue/rename switches over to reading uops from the MS-ROM instead of the IDQ.
Even though each uop doesn't have its own RIP (being part of a single microcoded instruction), I'd guess that the branch mispredict detection mechanism works similarly to normal branches.
rep movs
setup times on some CPUs seem to go in steps of 5 cycles depending on which case it is (small vs. large, alignment, etc). If these are from microcode branch mispredict, that would appear to mean that the mispredict penalty is a fixed number of cycles, unless that's just a special case of rep movs
. May be because the OoO back-end can keep up with the front-end? And reading from the MS-ROM shortens the path even more than reading from the uop cache, making the miss penalty that low.
It would be interesting to run some experiments into how much OoO exec is possible around rep movsb
, e.g. with two chains of dependent imul
instructions, to see if it (partially) serializes them like lfence. We hope not, but to achieve ILP the later imul
uops would have to issue without waiting for the back-end to drain.
I did some experiments here on Skylake (i7-6700k). Preliminary result: copy sizes of 95 bytes and less are cheap and hidden by the latency of the IMUL chains, but they do basically fully overlap. Copy sizes of 96 bytes or more drain the RS, serializing the two IMUL chains. It doesn't matter whether it's rep movsb
with RCX=95 vs. 96 or rep movsd
with RCX=23 vs. 24. See discussion in comments for some more summary of my findings; if I find time I'll post more details.
The "drains the RS" behaviour was measured with the rs_events.empty_end:u
even becoming 1 per rep movsb
instead of ~0.003. other_assists.any:u
was zero, so it's not an "assist", or at least not counted as one.
Perhaps whatever uop is involved only detects a mispredict when reaching retirement, if microcode branches don't support fast recovery via the BoB? The 96 byte threshold is probably the cutoff for some alternate strategy. RCX=0 also drains the RS, presumably because it's also a special case.
Would be interesting to test with rep scas
(which doesn't have fast-strings support, and is just slow and dumb microcode.)
Intel's 1994 Fast Strings patent describes the implementation in P6. It doesn't have an IDQ (so it makes sense that modern CPUs that do have buffers between stages and a uop cache will have some changes), but the mechanism they describe for avoiding branches is neat and maybe still used for modern ERMSB: the first n
copy iterations are predicated uops for the back-end, so they can be issued unconditionally. There's also a uop that causes the back-end to send its ECX value to the microcode sequencer, which uses that to feed in exactly the right number of extra copy iterations after that. Just the copy uops (and maybe updates of ESI, EDI, and ECX, or maybe only doing that on an interrupt or exception), not microcode-branch uops.
This initial n
uops vs. feeding in more after reading RCX could be the 96-byte threshold I was seeing; it came with an extra idq.ms_switches:u
per rep movsb
(up from 4 to 5).
https://eprint.iacr.org/2016/086.pdf suggests that microcode can trigger an assist in some cases, which might be the modern mechanism for larger copy sizes and would explain draining the RS (and apparently ROB), because it only triggers when the uop is committed (retired), so it's like a branch without fast-recovery.
The execution units can issue an assist or signal a fault by associating an event code with the result of a micro- op. When the micro-op is committed (§ 2.10), the event code causes the out-of-order scheduler to squash all the micro-ops that are in-flight in the ROB. The event code is forwarded to the microcode sequencer, which reads the micro-ops in the corresponding event handler"
The difference between this and the P6 patent is that this assist-request can happen after some non-microcode uops from later instructions have already been issued, in anticipation of the microcoded instruction being complete with only the first batch of uops. Or if it's not the last uop in a batch from microcode, it could be used like a branch for picking a different strategy.
But that's why it has to flush the ROB.
My impression of the P6 patent is that the feedback to the MS happens before issuing uops from later instructions, in time for more MS uops to be issued if needed. If I'm wrong, then maybe it's already the same mechanism still described in the 2016 paper.
Usually, when a branch mispredicts as being taken then when the instruction retires,
Intel since Nehalem has had "fast recovery", starting recovery when a mispredicted branch executes, not waiting for it to reach retirement like an exception.
This is the point of having a Branch-Order-Buffer on top of the usual ROB retirement state that lets you roll back when any other type of unexpected event becomes non-speculative. (What exactly happens when a skylake CPU mispredicts a branch?)
Footnote 1: IceLake is supposed to have the "fast short rep" feature, which might be a different mechanism for handling rep
strings, rather than a change to microcode. e.g. maybe a HW state machine like Andy mentions he wished he'd designed in the first place.
I don't have any info on performance characteristics, but once we know something we might be able to make some guesses about the new implementation.