How can adding code to a loop make it faster?

允我心安 提交于 2019-11-27 14:58:46

My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.

In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.

To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.

However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.


EDIT:

If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/

This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf

My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))

?

How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.

You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.

Are you just testing this inner loop, or are you testing your undisclosed outer loop as well? If so, look at these three lines:

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))  
    ++iSRGB;
*pDestination = (unsigned char) iSRGB;

Now, it looks like *pDestination is the counter for the outer loop. So by sometimes doing an extra increment of the iSRGB value you get to skip some of the iterations in the outer loop, thereby reducing the total amount of work the code needs to do.

I once had a similar situation. I hoisted some code out of a loop to make it faster, but it got slower. Confusing. Turns out, the average number of times though the loop was less than 1.

The lesson (which you don't need, obviously) is that a change doesn't make your code faster unless you measure it actually running faster.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!