Is GCC loop unrolling flag really effective?

前端 未结 3 1340
时光取名叫无心
时光取名叫无心 2021-01-31 05:01

In C, I have a task where I must do multiplication, inversion, trasposition, addition etc. etc. with huge matrices allocated as 2-dimensional arrays, (arrays of

3条回答
  •  死守一世寂寞
    2021-01-31 05:21

    Why unroll loops?

    Modern processors pipeline instructions. They like knowing what's coming next and make all sorts of fancy optimisations based on assumptions of which order the instructions should be executed.

    At the end of a loop though, there are two possibilities! Either you go back to the top, or continue on. The processor makes an educated guess on which is going to happen. If it gets it right, everything is good. If not, it has to flush the pipeline and stall for a bit while it prepares for taking the other branch.

    As you can imagine, unrolling a loop eliminates branches and the potential for those stalls, especially in cases where the odds are against a guess.

    Imagine a loop of code that executes 3 times, then continues. If you assume (as the processor probably would) that at the end you'll repeat the loop. 2/3 of the time, you'll be correct! 1/3 of the time though, you'll stall.

    On the other hand, imagine the same situation, but the code loops 3000 times. Here, there's probably only a gain 1/3000 of the time from unrolling.

    Why not unroll loops?

    Part of the processor fanciness mentioned above involves loading the instructions from the executable in memory into the processor's onboard instruction cache (shortened to I-cache). This holds a limited amount of instructions which can be accessed quickly, but may stall when new instructions need to be loaded from memory.

    Let's go back to the previous examples. Assume a reasonably small amount of code inside the loop takes up n bytes of I-cache. If we unroll the loop, it's now taking up n * 3 bytes. A bit more, but it'll probably fit in a single cache line just fine so your cache will be working optimally and not needing to stall reading from main memory.

    The 3000-loop, however, unrolls to use a whopping n * 3000 bytes of I-cache. That's going to require several reads from memory, and probably push some other useful stuff from elsewhere in the program out of the I-cache.

    So what do I do?

    As you can see, unrolling provides more benefits for shorter loops but ends up trashing performance if you're intending to loop a large number of times.

    Usually, a smart compiler will take a decent guess about which loops to unroll but you can force it if you're sure you know better. How do you get to know better? The only way is to try it both ways and compare timings!

    Premature optimization is the root of all evil -- Donald Knuth

    Profile first, optimise later.

提交回复
热议问题