About the branchless binary search

前端未结

关注

 4  1880

醉酒成梦 2021-01-21 06:50

I asked a question about reducing the miss prediction.

Jerry Coffin give me an impressive answer.

About reducing the branch miss prediciton

The binary se

4条回答

执笔经年 (楼主)

2021-01-21 07:08
The problem with the conditional move (branchless) search occurs when then arrays are large and the memory access time is large relative to a branch misprediction.

A conditional move search is something like:
```
int needle; // value we are searching for
int *base = ...; // base pointer
int n; // number of elements in the current region
while (n > 1) {
  int middle = n/2;
  base += (needle < *base[middle]) ? 0 : middle;
  n -= middle;
}
```
Note that we conditionally update base without using a branch (at least assuming the compiler doesn't decide to implement the ternary operator as a branch). The problem is that the value of base in each iteration is data-dependent on the result of the comparison in the previous iteration, and so accesses to memory occur one at a time, serialized via a the data dependency.

For a search over large array, this removes the possibility of memory-level parallelism and your search takes something like log2(N) * average_access_time. A branch-based search has no such data dependency: it has only a speculated control dependency between iterations: the CPU picks a direction and goes with it. If it guesses right, you'll be loading the result from the current iteration and the next at the same time! It doesn't end there: the speculation continues and you might have have a dozen loads in flight at once.

Of course, the CPU doesn't always guess right! In the worst case, if the branches are totally unpredictable (your data and needle value don't have kind of bias), it will be wrong half the time. Still, that means that on average it will sustain 0.5 + 0.25 + 0.125 + ... = ~1 additional accesses in flight beyond the current one. This isn't just theoretical: try a binary search over random data and you'll probably see the 2x speedup for branch-based over the branchless search, due to double the parallelism.

For many data sets the branch direction isn't entirely random, so you can see more than 2x speedup, as in your case.

The situation is reversed for small arrays that fit in cache. The branchless search still has this same "serial dependency" problem, but the load latency is small: a handful of cycles. The branch-based search, on the other hand suffers constant mispredictions, which cost on the order of ~20 cycles, so branchless usually ends up faster in this case.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...