What techniques to avoid conditional branching do you know?

前端未结

关注

 9  519

Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I\'ve seen a few techniques on very isol

相关标签:

9条回答

[愿得一人]

2021-02-04 00:10
Using Matt Joiner's example:
```
if (b > a) b = a;
```
You could also do the following, without having to dig into assembly code:
```
bool if_else = b > a;
b = a * if_else + b * !if_else;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-02-04 00:11

This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-02-04 00:13
An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.
```
if (a) {
  if (b) {
    result = q;
  } else {
    result = r;
  }
} else {
  if (b) {
    result = s;
  } else {
    result = t;
  }
}
```
If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:
```
// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];
```
You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.
0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2021-02-04 00:16

Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.

You're better off worrying about cache hits and misses.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2021-02-04 00:17
At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.

Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
- Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
- Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
- If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
  - Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
  - Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
- I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2021-02-04 00:18
GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.

Additionaly to compute minimum you can use (see these magic tricks):
```
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
```
However, pay attention to things like:
```
c[i][j] = min(c[i][j], c[i][k] + c[j][k]);   // from Floyd-Warshal algorithm
```
even no jumps are implied is much slower than
```
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
    c[i][j] = tmp;
```
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页