I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a tri
How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.
You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.
I once had a similar situation. I hoisted some code out of a loop to make it faster, but it got slower. Confusing. Turns out, the average number of times though the loop was less than 1.
The lesson (which you don't need, obviously) is that a change doesn't make your code faster unless you measure it actually running faster.
Are you just testing this inner loop, or are you testing your undisclosed outer loop as well? If so, look at these three lines:
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
++iSRGB;
*pDestination = (unsigned char) iSRGB;
Now, it looks like *pDestination
is the counter for the outer loop. So by sometimes doing an extra increment of the iSRGB
value you get to skip some of the iterations in the outer loop, thereby reducing the total amount of work the code needs to do.
My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.
In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.
To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.
However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.
EDIT:
If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/
This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf
My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
?