Why is this C++ code execution so slow compared to java?

后端 未结 2 1760
醉话见心
醉话见心 2021-02-04 15:55

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter

2条回答
  •  长情又很酷
    2021-02-04 16:32

    Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.

    Java's final JITted assembly code is similar to this (in reality repeated twice):

        while (true) {
            loopCount++;
            if (++intArray[i++] >= FINISH_TRIGGER) break;
            loopCount++;
            if (++intArray[i++] >= FINISH_TRIGGER) break;
            loopCount++;
            if (++intArray[i++] >= FINISH_TRIGGER) break;
            loopCount++;
            if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
            if (i >= ARRAY_LENGTH) i = 0;
        }
    

    With this loop I'm getting exact same timings (6.4s) between C++ and Java.

    Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.

    This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.

    Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.

    In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.

    For example, if you restructure your loop just a bit:

        while (!finished) {
            for (i=0; i= FINISH_TRIGGER) {
                    finished=true;
                    break;
                }
            }
        }
    

    Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)

    And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):

        while (!finished) {
            for (int i=0; i= FINISH_TRIGGER) {
                    loopCount-=ARRAY_LENGTH-i-1;
                    finished=true;
                    break;
                }
            }
        }
    

    Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

提交回复
热议问题