Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I\'ve seen a few techniques on very isol
GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.
Additionaly to compute minimum you can use (see these magic tricks):
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
However, pay attention to things like:
c[i][j] = min(c[i][j], c[i][k] + c[j][k]); // from Floyd-Warshal algorithm
even no jumps are implied is much slower than
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
c[i][j] = tmp;
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.