I will echo what others have said: a better algorithm is going to win in terms of performance gains.
That said, I work in image processing, which as a problem domain can be stickier. For example, many years ago I had a chunk of code that looked like this:
void FlipBuffer(unsigned char *start, unsigned char *end)
{
unsigned char temp;
while (start <= end) {
temp = _bitRev[*start];
*start++ = _bitRev[*end];
*end-- = temp;
}
}
which rotates a 1-bit frame buffer 180 degrees. _bitRev is a 256 byte table of reversed bits. This code is about as tight as you can get it. It ran on an 8MHz 68K laser printer controller and took roughly 2.5 seconds for a legal sized piece of paper. To spare you the details, the customer couldn't bear 2.5 seconds. The solution was an identical algorithm to this. The difference was that
- I used a 128K table and operated on words instead of bytes (the 68K is much happier on words)
- I used Duff's device to unroll the loop as much as would fit within a short branch
- I put in an optimization to skip blank words
- I finally rewrote it in assembly to take advantage of the sobgtr instruction (subtract one and branch on greater) and have "free" post increment and pre-decrements in the right places.
So 5x: no algorithm change.
The point is that you also need to understand your problem domain and what bottlenecks means. In image processing, algorithm is still king, but if your loops are doing extra work, multiply that work by several million and that's the price you pay.