I asked a question about reducing the miss prediction.
Jerry Coffin give me an impressive answer.
About reducing the branch miss prediciton
The binary se
I saw an interesting approach a while back, probably also on stackoverflow, about avoiding the data fetch cost. Someone wrote a binary search in such a way that they treated the array as an implicit tree and prefetched both the left child and the right child. This was done before the current element had even been compared to the test value.
It seemed strongly counterintuitive that increasing the memory demand twofold could actually speed up a search, but apparently starting the fetches earlier made up for the extra memory hit.
If I remember correctly, half the reads were effectively non-dependent, since the values weren't used. It can be done by speculative prefetch loads, non-dependent loads, or ordinary loads where one of the values fetched is moved into the register holding the current element when looping.