I asked a question about reducing the miss prediction.
Jerry Coffin give me an impressive answer.
About reducing the branch miss prediciton
The binary se
I saw an interesting approach a while back, probably also on stackoverflow, about avoiding the data fetch cost. Someone wrote a binary search in such a way that they treated the array as an implicit tree and prefetched both the left child and the right child. This was done before the current element had even been compared to the test value.
It seemed strongly counterintuitive that increasing the memory demand twofold could actually speed up a search, but apparently starting the fetches earlier made up for the extra memory hit.
If I remember correctly, half the reads were effectively non-dependent, since the values weren't used. It can be done by speculative prefetch loads, non-dependent loads, or ordinary loads where one of the values fetched is moved into the register holding the current element when looping.
Because that version is doing a ton of loads and stores.
Branch prediction in a tight loop like that often has no effect because the processor has multiple pipelines. As the branch test is being evaluated, both code paths are already being decoded and evaluated. Only the results of one path are kept - but there is usually no pipeline stall from a branch.
Writing to memory on the other hand can have an effect. Usually you are writing to a memory cache on the CPU, but the MMU then has to keep the cache lines sync'd to the rest of the system If the array is large and you are accessing it in essentially random order, you are getting constant cache misses and making the CPU reload memory cache.
The problem with the conditional move (branchless) search occurs when then arrays are large and the memory access time is large relative to a branch misprediction.
A conditional move search is something like:
int needle; // value we are searching for
int *base = ...; // base pointer
int n; // number of elements in the current region
while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}
Note that we conditionally update base
without using a branch (at least assuming the compiler doesn't decide to implement the ternary operator as a branch). The problem is that the value of base
in each iteration is data-dependent on the result of the comparison in the previous iteration, and so accesses to memory occur one at a time, serialized via a the data dependency.
For a search over large array, this removes the possibility of memory-level parallelism and your search takes something like log2(N) * average_access_time
. A branch-based search has no such data dependency: it has only a speculated control dependency between iterations: the CPU picks a direction and goes with it. If it guesses right, you'll be loading the result from the current iteration and the next at the same time! It doesn't end there: the speculation continues and you might have have a dozen loads in flight at once.
Of course, the CPU doesn't always guess right! In the worst case, if the branches are totally unpredictable (your data and needle value don't have kind of bias), it will be wrong half the time. Still, that means that on average it will sustain 0.5 + 0.25 + 0.125 + ... = ~1
additional accesses in flight beyond the current one. This isn't just theoretical: try a binary search over random data and you'll probably see the 2x speedup for branch-based over the branchless search, due to double the parallelism.
For many data sets the branch direction isn't entirely random, so you can see more than 2x speedup, as in your case.
The situation is reversed for small arrays that fit in cache. The branchless search still has this same "serial dependency" problem, but the load latency is small: a handful of cycles. The branch-based search, on the other hand suffers constant mispredictions, which cost on the order of ~20 cycles, so branchless usually ends up faster in this case.
Use your original binary search then. Array accesses to random locations aren't much better than branch misses, especially since the compiler can't use registers for the variables in that case.