Branch target prediction in conjunction with branch prediction?

落爺英雄遲暮 提交于 2019-11-30 01:42:07

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for "Architectures optimization manual". Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation details are considered a trade secret and do change between architectures. Search the manual for "branch prediction" to find references, it is fairly spread among the chapters.

I'll give a summary of what's found in the manual, adding details where appropriate:

Branch prediction is the job of the BPU unit in the core (Branch Prediction Unit). Roughly correlates to "BP" in your question. It contains several sub-units:

  • The branch history table. This table keeps track of previously taken conditional branches and is consulted by the predictor to decide if a branch is likely to be taken. Is is fed with entries by the instruction retirement unit, the one that knows whether the branch was actually taken. This is the sub-unit that has changed the most as the architectures improved, getting deeper and smarter as more real estate became available.

  • The BTB, Branch Target Buffer. This buffer stores the target address of a previously taken indirect jump or call. This correlates to "BTP" in your question. The manual does not state whether the buffer can store multiple targets per address, indexed by the history table, I consider it likely for later architectures.

  • The Return Stack Buffer. This buffer acts a "shadow" stack, storing the return address for CALL instructions, making the target of a RET instruction available with a high degree of confidence without the processor having to rely on the BTB, unlikely to be as effective for calls. It is documented to be 16 levels deep.

Bullet 2) in a bit difficult to answer accurately, the manual only talks about the "Front End" and does not break down the details of the pipeline. Appropriate enough, it is heavily architecture dependent. The diagram in section 2.2.5 is possibly illustrative. The execution trace cache plays a role, it stores previously decoded instructions so is the primary source of BPU consultations. Otherwise right after the instruction translator (aka decoder).

BP and BTP are naturally closely related, but they're obviously not the same thing. I think your biggest confusion comes from the claim that since BTP predicts the target of a given branch, it can tell you the outcome (i.e. - what will be the next instruction executed). That's not the case.

A branch target is the address this branch may send you to, if it's taken. Whether or not the branch is taken is a completely different question and is addressed by the branch predictor. In fact the two units would usually work together on early stages of the pipeline - and produce (if needed) both the taken/not-taken and the address prediction. Then comes the complicated logic that says basically - If it's a branch, and it's predicted taken (or is unconditional), then jump to the target if you have it (whether known or predicted).

As you quoted yourself in the branch types list - the question of whether a branch needs to predict being taken or not (is it conditional), and whether a branch needs to predict the target (is it direct / fixed target as you call it) are both applicable and each could go both ways unrelated to the other, thereby providing you with the 4 choices you listed:

  • unconditional direct branches, in theory, do not require any prediction - the CPU front end would simply read the target and "take" the branch (feeding the pipeline code from the new address). However, modern CPUs would still require time to decode the branch and identify the target encoded there, so to avoid stalls at the branch predictor (which is normally at the head of the pipe), they will also have to predict that address. Confirming the prediction is simple though (immediately after decode), so the penalty for misprediction isn't very high. It could still be stalled due to code cache / tlb misses, but is otherwise the fastest (but one might say the weakest)

  • conditional direct branched know their target after decode (but again - must predict it ahead of that), but can't tell whether the branch is taken or not until the condition is executed and the resolution is made, which may be very far down the pipe. This in turn may depend on earlier instructions and could get stalled until the condition sources are known. So there are two predictions made - target and direction (unless the direction is fall-through in which case there's not need for a target), but the direction resolution is the more risky one. The branch predictor (actually, on modern CPUs there are usually several of them), would make an educated guess and continue fetching from there. Some studies have even been made, in the academy mostly, on trying to fetch and execute both paths (although you could immediately see that this may explode exponentially since you usually have a branch every few instruction, so it's usually reserved to hard-to-predict ones). Another popular option is "predicating" (mind the 'a' there..) the two paths, i.e. sending some bits down the pipeline to mark which path it is, for easy flushing the wrong path once the resolution is known. This is quite popular on dataflow machines due to the language structure, but that's an entirely new question.

  • unconditional indirect branches - these are nasty since they're both common (every ret for e.g.), and harder to predict. While the branch resolution was simple in the previous case (and could always rely on some heuristics or pattern guessing), this one needs to provide an actual address, so you probably have to visit this specific branch with this specific target a few times to let the BTP learn the pattern there.

  • conditional indirect branches - well, bad luck for you, you need both predictions...

So, the decisions are orthogonal, but that doesn't mean the predictors have to be so. Keep in mind that you have a single "stream" of branch history, so it probably pays to have the predictor related somehow, sharing some tables or some logic. How exactly is a design decision and depends on the actual HW implementation, you're probably not going to get a lot of details on how Intel/AMD do that, but there are plenty of academic researches on that topic.

As for the second question - it's a bit broad, and again - you won't be able to get all the exact details on real CPUs, but you could get hints here and there - see for e.g. the diagram from this Haswell review (which may have appeared here before somewhere) :

This diagram doesn't tell you everything, it's obviously missing the inputs for the BP/BTP, or even the distinction between them (which in itself already tells you they're probably built together), but it does show you that this is apparently the first and foremost part of the pipeline. You need to predict the next instruction pointer before you can go ahead and feed it into the fetch/decode/... pipeline (or the alternative uop-cache one). This probably means that the CPU starts every cycle (well, yeah, everything is really done in parallel but it helps to think of a pipeline as a staged process), by thinking which instruction to perform next. Let's say he knows where we were the last time, so it's either a non-branch instruction (ahh, but what about varying length.. another complication this unit needs to solve), or a branch, in which case this unit should guess which of the above types this branch belongs to, and predict the next instruction accordingly.

Note that I wrote "guess" - if the diagram tells the truth, the decode stage is really far away, you don't even know it's a branch at this point. So to answer your question - this BP/BTP unit needs to communicate with the execution/WB units so it could know the outcome of conditional branches, with the decode unit so it could know what instruction currently being decided is a branch and what type it is, with the different pipelines of fetch to feed them the output. I'm guessing there are further relations with other units (for e.g. some designs may decide to send code prefetches based on target predictions, etc..).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!