Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I\'ve seen a few techniques on very isol
At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.
Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.