Why did Intel change the static branch prediction mechanism over these years?

前端未结

关注

 3  487

余生分开走 2021-01-04 08:46

From here I know Intel implemented several static branch prediction mechanisms these years:

80486 age: Always-not-taken
Pentium4 age: Backwa

3条回答

花落未央 (楼主)

2021-01-04 08:58

My understanding is that with current designs, modern TAGE branch direction predictors always index to an entry, using the taken/not-taken history of recent branches. (This potentially spreads the state for a single branch out over a lot of internal state, making it possible to predict very complex patterns like a 10 element BubbleSort.)

The CPU doesn't try to detect aliasing and just uses the prediction it finds to decide taken/not-taken for conditional branches. i.e. branch-direction prediction is always dynamic, never static.

But a target prediction is still needed before the branch is even decoded to keep the front-end from stalling. The Branch Target Buffer is normally tagged, because the target of some other branch that aliased is unlikely to be useful.

As @Paul A Clayton points out, a BTB miss could let the CPU decide to use static prediction instead of whatever it found in the dynamic taken / not-taken predictor. We might just be seeing that it's much harder to make the dynamic predictor miss often enough to measure static prediction.

(I might be distorting things. Modern TAGE predictors can predict complex patterns for indirect branches too, so I'm not sure if they even try to predict in terms of taken/not-taken or if the first step is always just to try to predict the next address, whether or not that's the next instruction. Indexed branch overhead on X86 64 bit mode.)

Not-taken branches are still slightly cheaper in the correctly-predicted case, because the front-end can more easily fetch earlier and later instructions in the same cycle from the uop cache. (The uop cache in Sandybridge-family is not a trace cache; a uop-cache line can only cache uops from a contiguous block of x86 machine code.) In high-throughput code, taken branches could be a minor front-end bottleneck. They also typically spread the code out over more L1i and uop-cache lines.

For indirect branches, the "default" branch-target address is still next-instruction, so it can be useful to put a ud2 or something after a jmp rax to prevent mis-speculation (especially into non-code), if you can't simply put one of the real branch targets as the next instruction. (Especially the most common one.)

Branch prediction is kind of the "secret sauce" that CPU vendors don't publish details about.

Intel actually publishes instruction throughput / latency / execution-port info themselves (through IACA and some documents), but it's fairly straightforward to test experimentally (like https://agner.org/optimize/ and http://instlatx64.atw.hu/ have done) so it's not like Intel could keep that secret even if they wanted to.

Branch-prediction success rate is easy to measure with perf counters, but knowing why one specific branch was mispredicted or not on one specific execution is very hard; even measuring is hard for a single execution of one branch, unless you instrument your code with rdtsc or rdpmc or something.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...