问题
After reading this:
When an interrupt occurs, what happens to instructions in the pipeline?
There is not much information on what happens to software interrupts but we do learn the following:
Conversely, exceptions, things like page faults, mark the instruction affected. When that instruction is about to commit, at that point all later instructions after the exception are flushed, and instruction fetch is redirected.
I was wondering what would happen to software interrupts (INT 0xX) in the pipeline, firstly, when are they detected? Are they detected at the predecode stage perhaps? In the instruction queue? At the decode stage? Or do they get to the backend and immediately complete (don't enter the reservation station), retire in turn and the retirement stage picks up that it is an INT instruction (seems wasteful).
Let's say it is picked up at predecode, there must be a method of signalling the IFU to stop fetching instructions or indeed clock/power gating it, or if it's picked up in the instruction queue, a way of flushing instructions before it in the queue. There must then be a way of signalling to some sort of logic ('control unit') for instance to generate the uops for the software interrupt (indexing to IDT, checking DPL >=CPL >=segment RPL, etc, etc), naive suggestion, but if anyone knows any better about this process, great.
I also wonder how it handles it when this process is disturbed, i.e. a hardware interrupt occurs (bearing in mind traps don't clear IF in EFLAGS) and now has to begin a whole new process of interrupt handling and uop generation, how would it get back to its state of handling the software interrupt afterwards.
回答1:
That quote from Andy @Krazy Glew is about synchronous exceptions discovered during execution of a "normal" instruction, like mov eax, [rdi]
raising #PF if it turns out that RDI is pointing to an unmapped page.1 You expect that not to fault, so you defer doing anything until retirement, in case it was in the shadow of a branch mispredict or an earlier exception.
But yes, his answer doesn't go into detail about how the pipeline optimizes for synchronous int
trap instructions that we know upon decode will always cause an exception. Trap instructions are also pretty rare in the overall instruction mix, so optimizing for them doesn't save you a lot of power; it's only worth doing the things that are easy.
As Andy says, current CPUs don't rename the privilege level and thus can't speculate into an interrupt/exception handler, so stalling fetch/decode after seeing an int
or syscall
is definitely sensible thing. I'm just going to write int
or "trap instruction", but the same goes for syscall
/sysenter
/sysret
/iret
and other privilege-changing "branch" instructions. And the 1-byte versions of int like int3
(0xcc
) and int1
(0xf1
). The conditional trap-on-overflow into
is interesting; for non-horrible performance in the no-trap case it's probably assumed not to trap. (And of course there are vmcall
and stuff for VMX extensions, and probably SGX EENTER
, and probably other stuff. But as far as stalling the pipeline is concerned, I'd guess all trap instructions are equal except for the conditional into
)
I'd assume that like lfence
, the CPU doesn't speculate past a trap instruction. You're right, there'd be no point in having those uops in the pipeline, because anything after an int
is definitely getting flushed.
IDK if anything would fetch from the IVT (real-mode interrupt vector table) or IDT (interrupt descriptor table) to get the address of an int
handler before the int
instruction becomes non-speculative in the back-end. Possibly. (Some trap instructions, like syscall
, use an MSR to set the handler address, so starting code fetch from there would possibly be useful, especially if it triggers an L1i miss early. This has to be weighed against the possibility of seeing int
and other trap instructions on the wrong path, after a branch miss.)
Mis-speculation hitting a trap instruction is probably rare enough that it would be worth it to start loading from the IDT or prefetching the syscall
entry point as soon as the front-end sees a trap instruction, if the front-end is smart enough to handle all this. But it probably isn't. Leaving the fancy stuff to microcode makes sense to limit complexity of the front end. Traps are rare-ish, even in syscall
-heavy workloads. Batching work to hand off in bigger chunks across the user/kernel barrier is a good thing, because cheap syscall
is very very hard post Spectre...
So at the latest, a trap would be detected in issue/rename (which already knows how to stall for (partially) serializing instructions), and no further uops would be allocated into the out-of-order back end until either the int
was retired and the exception was being taken.
But detecting it in decode seems likely, and not decoding further past an instruction that definitely takes an exception. (And where we don't know where to fetch next.) The decoder stage does know how to stall, e.g. for illegal-instruction traps.
Let's say it is picked up at predecode
That's probably not practical, you don't know it's an int
until full decode. Pre-decode is just instruction-length finding on Intel CPUs. I'd assume that the opcodes for int
and syscall
are just two of many that have the same length.
Building in HW to look deeper searching for trap instructions would cost more power than it's worth in pre-decode. (Remember, traps are very rare, and detecting them early mostly only saves power, so you can't spend more power looking for them than you save by stopping pre-decode after passing along a trap to the decoders.
You need to decode the int
so its microcode can execute and get the CPU started again running the interrupt handler, but yes in theory you could have pre-decode stall in the cycle after passing it through.
It's the regular decoders where jump instructions that branch-prediction missed are identified, for example, so it makes much more sense for the main decode stage to handle traps by not going any further.
Hyperthreading
You don't just power-gate the front-end when you discover a stall. You let the other logical thread have all the cycles.
Hyperthreading makes it less valuable for the front-end to start fetching from memory pointed to by the IDT without the back-end's help. If the other thread isn't stalled, and can benefit from the extra front-end bandwidth while this thread sorts out its trap, the CPU is doing useful work.
I certainly wouldn't rule out code-fetch from the SYSCALL entry-point, because that address is in an MSR, and it's one of the few traps that is performance-relevant in some workloads.
Another thing I'm curious about is how much if any performance impact one logical core switching privilege levels has on performance of the other core. To test this, you'd construct some workload that bottlenecked on your choice of front-end issue bandwidth, a back-end port, back-end dep chain latency, or the back-end's ability to find ILP over a medium to long distance (RS size or ROB size). Or a combination or something else. Then compare cycles/iteration for that test workload running on a core to itself, sharing a core with a tight dec/jnz
thread, a 4x pause / dec/jnz
workload, and a syscall
workload that makes ENOSYS system calls under Linux. Maybe also an int 0x80
workload to compare different traps.
Footnote 1: Exception handling, like #PF on a normal load.
(Off topic, re: innocent looking instructions that fault, not trap instructions that can be detected in the decoders as raising exceptions).
You wait until commit (retirement) because you don't want to start an expensive pipeline flush right away, only to discover that this instruction was in the shadow of a branch miss (or an earlier faulting instruction) and shouldn't have run (with that bad address) in the first place. Let the fast branch-recovery mechanism catch it.
This wait-until-retirement strategy (and a dangerous L1d cache that doesn't squash the load value to 0 for L1d hits where the TLB says it's valid but no read permission) is the key to why Meltdown and L1TF exploit works on some Intel CPUs. (http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/). Understanding Meltdown is pretty helpful to understanding synchronous exception handling strategies in high-performance CPUs: marking the instruction and only doing anything if it reaches retirement is a good cheap strategy because exceptions are very rare.
It's apparently not worth the complexity to have execution units signal back to the front-end to stop fetch / decode / issue if any uop in the back end detects a pending #PF
or other exception. (Presumably because that would more tightly couple parts of the CPU that are otherwise pretty far apart.)
And because instructions from the wrong path might still be in flight during fast recovery from a branch miss, and making sure you only stop the front-end for expected faults on what we think is the current correct path of execution would require more tracking. Any uop in the back-end was at one point thought to be on the correct path, but it might not be anymore by the time it gets to the end of an execution unit.
If you weren't doing fast recovery, then maybe it would be worth having the back-end send a "something is wrong" signal to stall the front-end until the back-end either actually takes an exception, or discovers the correct path.
With SMT (hyperthreading), this could leave more front-end bandwidth for other threads when a thread detected that it was currently speculating down a (possibly correct) path that leads to a fault.
So there is maybe some merit to this idea; I wonder if any CPUs do it?
回答2:
I agree with everything Peter said in his answer. While they can be many ways to implement the INTn
instructions, the implementation would most probably be tuned for CPU design simplicity rather than performance. The earliest point at which it can be non-speculatively determined that such an instruction exists is at the end of the decode stage of the pipeline. It might be possible to predict whether the fetched bytes may contain an instruction that may or does raise an exception, but I couldn't find a single research paper that studies this idea, so it doesn't seem to be worth it.
Execution of INTn
involves fetching the specified entry from the IDT, performing many checks, calculating the address of the exception handler, and then telling the fetch unit to start prefetching from there. This process depends on the operating mode of the processor (real mode, 64-bit mode, etc.). The mode is described by multiple flags from the CR0
, CR4
, and Eflags
registers. Therefore, it would take many uops to actually invoke an exception handler. In Skylake, there are 4 simple decoders and 1 complex decoder. The simple decoders can only emit a single fused uop. The complex decoder can emit up to 4 fused uops. None of them can handle INTn
, so the MSROM needs to be engaged in order to execute the software interrupt. Note that the INTn
instruction itself might cause an exception. At this point, it's unknown whether INTn
itself will change control to the specified exception handler (whatever its address is) or to some other exception handler. All that is known for sure is that instruction stream will definitely end at INTn
and begin somewhere else.
There are two possible ways in which the microcode sequencer is activated. The first one is when decoding a macroinstruction that requires more than 4 uops, similar to rdtsc
. The second is when retiring an instruction and at least of its uops has a valid event code in its ROB entry. According to this patent, there is a dedicated event code for software interrupts. So I think INTn
is decoded into a single uop (or up to 4 uops) that carries with it the interrupt vector. The ROB already needs to have a field to hold information that describes whether the corresponding instruction has raised an exception and what kind of exception. The same field could be use to hold the interrupt vector. The uop simply passes through the allocation stage and may not need to be scheduled into one of the execution units because no computation needs to be done. When the uop is about to retire, the ROB determines that it is INTn
and that it should raise an event (see Figure 10 from the patent). At this point, there are two possible ways to proceed:
- The ROB invokes a generic microcode assist that first checks the current operating mode of the processor and then selects a specialized assist that corresponds to the current mode.
- The ROB unit itself includes logic to check the current operating mode and selects the corresponding assist. It passes the assist address to the logic responsible for raising events, which in turn directs the MSROM to emit the assist routine stored at that address. This routine contains uops that fetch the IDT entry and perform the rest of the exception handler invocation process.
During the execution of assist, an exception may occur. This will be handled like any other instruction that causes an exception. The ROB unit extracts the exception description from the ROB and invokes an assist to handle it.
Invalid opcodes can be handled in a similar fashion. At the predcode stage, the only thing that matters is correctly determining the lengths of the instructions that precede the invalid opcode. After these valid instructions, the boundaries are irrelevant. When a simple decoder receives an invalid opcode, it emits a special uop whose sole purpose is just to raise an invalid opcode exception. The other decoders responsible for instructions that succeed the last valid instruction can all emit the special uop. Since instructions are retired in order, it's guaranteed that the first special uop will raise an exception. Unless of course a previous uop raised an exception or a branch misprediction or memory ordering clear event occurred.
When any of the decoders emit that special uop, the fetch and decode stages could stall until the address of the macro-instruction exception handler is determined. This could be either for the exception specified by the uop or some other exception. For every stage that processes that special uop, the stage can just stall (power down / clock gate) itself. This saves power and I think it would be easy to implement.
Or if the other logical core is active, treat it like any other reason for this logical thread to give up its front-end cycles to the other hyperthread. Allocation cycles normally alternate between hyperthreads, but when one is stalled (e.g. ROB full or front-end empty) the other thread can allocate in consecutive cycles. This might also happen in the decoders, but and maybe that could be tested with a large enough block of code to stop it running from the uop cache. (Or too dense to go into the uop cache).
来源:https://stackoverflow.com/questions/54427842/what-happens-to-software-interrupts-in-the-pipeline