Is it realistic to use -O3 or -Ofast to compile your benchmark code or will it remove code?

后端 未结 3 889
猫巷女王i
猫巷女王i 2021-01-18 23:25

When compiling the benchmark code below with -O3 I was impressed by the difference it made in latency so i began to wonder whether the compiler is not \"cheatin

相关标签:
3条回答
  • 2021-01-19 00:00

    You should always benchmark with optimizations turned on. However it is important to make sure the things you want to time do not get optimized away by the compiler.

    One way to do this by printing out calculation results after the timer has stopped:

    long long x = 0;
    
    for(int i = 0; i < iterations; i++) {
    
        long start = get_nano_ts(); // START clock
    
        for(int j = 0; j < load; j++) {
            if (i % 4 == 0) {
                x += (i % 4) * (i % 8);
            } else {
                x -= (i % 16) * (i % 32);
            }
        }
    
        long end = get_nano_ts(); // STOP clock
    
        // now print out x so the compiler doesn't just ignore it:
        std::cout << "check: " << x << '\n',
    
        // (omitted for clarity)
    }
    

    When comparing benchmarks for several different algorithms that can also serve as a check that each algorithm is producing the same results.

    0 讨论(0)
  • 2021-01-19 00:03

    The compiler will certainly be "cheating" and removing unnecessary code when compiling with optimization enabled. It actually goes great length to speed up your code which almost always will lead to impressive speed-ups. If it was somehow able to derive a formula that calculates the result in constant time instead of using this loop, it would. A constant factor 15 is nothing out of the ordinary.

    But this does not mean that you should profile un-optimized builds! Indeed, when using languages like C and C++, the performance of un-optimized builds is pretty much completely meaningless. You need not worry about that at all.

    Of course, this can interfere with micro-benchmarks as the one you showed above. Two points to that:

    1. More often than not, such micro optimization do not matter either. Prefer profiling your actual program and then removing bottlenecks.
    2. If you actually want such a micro benchmark, make it depend on some runtime input and display the result. That way, the compiler cannot remove the functionality itself, just make it reasonably fast.

    Since you seem to be doing that, the code you show has a good chance of being a reasonable micro benchmark. One thing you should watch out for is whether your compiler moves both calls to get_nano_ts(); to the same side of the loop. It is allowed to do this since "run time" does not count as observable side effect. (The standard does not even mandate your machine operating at finite speed.) It was argued here that this usually is not a problem, though I cannot really judge whether the answer given is valid or not.

    If your program does not do anything expensive other then the thing you want to benchmark (which it, if possible, should not do anyways), you can also move the time measurement "outside", e.g. with time.

    0 讨论(0)
  • 2021-01-19 00:16

    It can be very difficult to benchmark what you think you are measuring. In the case of the inner loop:

    for (int j = 0;  j < load;  ++j)
            if (i % 4 == 0)
                    x += (i % 4) * (i % 8);
            else    x -= (i % 16) * (i % 32);
    

    A shrewd compiler might be able to see through that and change the code to something like:

     x = load * 174;   // example only
    

    I know that isn't equivalent, but there is some fairly simple expression which can replace that loop.

    The way to be sure is to use the gcc -S compiler option and look at the assembly code it generates.

    0 讨论(0)
提交回复
热议问题