问题
When I reading RISC-V User-Level ISA manual,I noticed that it said that "OpenRISC has condition codes and branch delay slots, which complicate higher performance implementations." so RISC-V don't have branch delay slot RISC-V User-Level ISA manual link. Moreover,Wikipedia said that most of newer RISC design omit branch delay slot. Why most of newer RISC Architecture gradually omit branch delay slot?
回答1:
Citing Henessy and Patterson (Computer architecture and design, 5th ed.)
Fallacy : You can design a flawless architecture.
All architecture design involves trade-offs made in the context of a set of hardware and software technologies. Over time those technologies are likely to change, and decisions that may have been correct at the time they were made look like mistakes. (...) An example in the RISC camp is delayed branch. It was a simple matter to control pipeline hazards with five-stage pipelines, but a challenge for processors with longer pipelines that issue multiple instructions per clock cycle.
Indeed, in terms of software, delayed branch only has drawbacks as it makes programs more difficult to read and less efficient as the slot is frequently filled by nops.
In terms of hardware, it was a technological decision that has some sense in the eighties, when pipeline was 5 or 6 stages and there was no way to avoid the one cycle branch penalty.
But presently, pipelines as much more complex. Branch penalty is 15-25 cycles on recent pentium μarchitectures. One instruction delayed branch is thus useless and it would be a nonsense and clearly impossible to try to hide this delay slot with a 15 instructions delayed branch (that would break instruction sets compatibility).
And we have developed new technologies. Branch prediction is a very mature technology. With present branch predictors, misprediction is by far lower than the number of branches with a useless (nop
) delay slot and is accordingly more efficient, even on a 6 cycles computer (like nios-f).
So delayed branches are less efficient in hardware and software. No reason to keep them.
回答2:
Delay slots are only helpful on a short in-order scalar pipeline, not high-performance superscalar, or especially one with out-of-order execution.
They complicate exception handling significantly (for HW and software), because you need to record current program-counter and separately a next-PC address in case the instruction in the delay slot takes an exception.
They also complicate How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS? by introducing multiple possibilities like the branch-delay instruction is already in the pipeline and needs to not be killed, vs. still waiting on an I-cache miss so re-steering the front-end needs to wait until after it fetched the branch-delay instruction.
Branch-delay slots architecturally expose an implementation detail of in-order classic RISC pipelines to the benefit of performance on that kind of uarch, but anything else has to work around it. It only avoids code-fetch bubbles from taken branches (even without branch prediction) if your uarch is a scalar classic RISC.
Even a modern in-order uarch needs branch prediction for good performance, with memory latency (measured in CPU clock cycles) being vastly higher than in the days of early MIPS.
(Fun fact: MIPS's 1 delay slot was sufficient to hide the total branch latency on R2000 MIPS I, thanks to clever design that kept that down to 1 cycle.)
Branch delay slots can't always be filled optimally by compilers, so even if we can implement them in a high-performance CPU without significant overhead, they do cost throughput in terms of total work done per instruction. Programs will usually need to execute more instructions, not less, with delay slots in the ISA.
(Although sometimes doing something unconditional after the compare-and-branch can allow reuse of the register instead of needing a new register, on an ISA without flags like MIPS where branch instructions test integer registers directly.)
回答3:
Branch delay slots were introduced as a performance workaround in the earliest single-issue, in-order RISC implementations. As early as the second commercial implementations of these architectures it was already clear that both the delay slot and the notion of a single condition code were going to be in the way. By the time we did the 64-bit SPARC architecture at HaL, register windows had been added to that list. The combined challenges were enough that we proposed supporting SPARC32 using dynamic binary translation so that we could abandon the legacy burden. The cost of them at that point was 40% of the chip area and 20% to 25% of the instruction issue rate.
Modern processor implementations are aggressively out-of-order (read up on "register renaming" or "Tomasulo's algorithm"), dynamically scheduled, and in many cases multi-issue. In consequence, the delayed branch has gone from being a performance enhancement to a complication that the instruction sequencing unit and the register rename logic have to carefully step around for the sake of compatibility.
Frankly, it wasn't a great idea on the SOAR/SPARC or the MIPS chip either. Delayed branches create interesting challenges for single-step in debuggers, for dynamic binary translators, and for binary code analysis (I've implemented all of these at one time or another). Even on the single-issue machines, they created some interesting complications for exception handling. As early as the second commercial implementation of these instruction sets both the delay slot and the single condition code notions were already getting in the way.
Alain's comment about branch cost on Pentium doesn't carry over straightforwardly to RISC parts, and the issue is a bit more complicated than he suggests. On fixed-length instruction sets, it's straightforward to implement something called a "branch target buffer", which caches the instructions at branch targets so that there is no pipeline stall arising from the branch. On the original RISC machine (the IBM 603), John Cocke incorporated a "prepare to branch" instruction whose purpose was to allow the program (or more precisely, the compiler) to explicitly load likely targets into the branch target buffer. In a good implementation, instructions in the BTB are pre-decoded, which shaves a cycle off the pipeline and makes a correctly predicted transition through the BTB very nearly free. The problem at that point is the condition codes and misprediction.
Because of the BTB and multi-issue, the notion of a branch delay and a branch mispredict delay need to be re-imagined. What actually happens on many multi-issue machines is that the processor proceeds down both paths of the branch - at least while it can get the instructions from the currently preloaded cache line in the instruction fetch unit or the instructions in the BTB. This has the effect of slowing instruction issue on both sides of the branch but also lets you make progress on both sides of the branch. When the branch resolves, the "should not have taken" path is abandoned. For integer processing this slows you down. For floating point it's less clear because the computational operations take several cycles.
Internally, an aggressively multi-issue machine is likely to have three or four operations queued up internally at the time of the branch, so the branch delay can often be compensated by executing these already-queued instructions and then re-building the queue depth.
来源:https://stackoverflow.com/questions/54724410/why-is-the-branch-delay-slot-deprecated-or-obsolete