How could this Java code be sped up?

二次信任 提交于 2019-12-05 23:54:57

First of all, -O3 enables:

-finline-functions
-ftree-vectorize

among others...

So it looks like it actually might be vectorizing.

EDIT : This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version

Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.


Now, if I were a smart C/C++ compiler, here's how I would arrange that loop (on x64):

int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;

int tmp0 = c1;
int tmp1 = 0;
int tmp2 = 0;
int tmp3 = 0;

int z0 = 0;
int z1 = 1;
int z2 = 2;
int z3 = 3;

do{
    tmp0 ^= z0 + c2;
    tmp1 ^= z1 + c2;
    tmp2 ^= z2 + c2;
    tmp3 ^= z3 + c2;
    z0 += 4;
    z1 += 4;
    z2 += 4;
    z3 += 4;
}while (z0 < 100);

tmp0 ^= tmp1;
tmp2 ^= tmp3;

tmp0 ^= tmp2;

return tmp0;

Note that this loop is completely vectorizable.

Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?

Use the Hotspot JVM in server mode, and make sure to warm it up. Also give enough time for the garbage collection algorithms to settle down to a stable pace if collection is a major part of your test. I don't see anything at a glance that makes me think it would be...

Interesting question. :-) This is probably more of a comment since I won't really answer your question, but it's too long for the comment box.

Micro-benchmarking in Java is tricky because the JIT can go nuts with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its normal optimizations.

Normally, this code would run in O(1) time because your main loop has no effect on anything:

    for (int k = 0; k < list.length; k++) {
        rs = calc(list[k++], list[k++], list[k++], list[k]);
    }

Note that the final result of rs doesn't really depend on running all iterations of the loop; just the last one. You can calculate the final value of "k" for the loop without having to actually run the loop. Normally the JIT would notice that and turn your loop into a single assignment, it it's able to detect that the function being called (calc) has no side-effects (which it doesn't).

But, somehow, this statement in the calc() function messes up the JIT:

        c1 ^= z + c2;

Somehow that adds too much complexity for the JIT to decide that all this code in the end doesn't change anything and that the original loop can be optimized out.

If you change that particular statement to something even more pointless, like:

        c1 = z + c2;

Then the JIT picks things up and optimizes your loops away. Try it out. :-)

I tried locally with a much smaller data set and with the "^=" version calculations took ~1.6s, while with the "=" version they took 0.007 seconds (or, in other words, it optimized away the loop).

As I said, not really a response, but I thought this might be interesting.

Did you try "inlining" parse() and calc(), i.e. put all the code in main()?

What is the score if you move the few lines of your calc function inside of your list iteration?
I know it's not very clean, but you'll gain over the call stack.

[...]
    for (int k = 0; k < list.length; k++) {
        int a1 = list[k++];
        int a2 = list[k++];
        int b1 = list[k++];
        int b2 = list[k];

        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        rs = c1;
    }

The MappedByteBuffer is only contributing about 20% in I/O performance and it is an enormous memory cost - if it causes swapping the cure is worse than the disease.

I would use a BufferedReader around a FileReader, and maybe a Scanner around that to get the integers, or at least Integer.parseInt(), which is a lot more likely to have been warmed up by HotSpot than your own radix conversion code.

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data.

If the task is to do a meaningless calculation, then the best optimization is to not do the calculation.

If what you are really trying to do here is to figure out if there is a general technique to make a computation go faster, then I think you are barking up the wrong tree. There is no such technique. What you learn on optimizing a meaningless calculation is not likely to apply to other (hopefully meaningfull) calculations.

If calculation is not meaningless, and the aim is to make the whole program go faster, you've probably already reached the point where optimization is a waste of time.

  • Current (Java) - 26.50s + 11.27s = ~38 seconds
  • Goal (C++) - ~26.5s + 4.50 = ~31 seconds
  • Percentage speedup - less than 20%.

A speedup of less than 20% for a ~40 second computation is probably not worth the effort. It is cheaper to get the user to twiddle his thumbs for those extra 7 seconds.


This is also telling you something interesting. That in this scenario, it doesn't make much difference in relative terms whether you use C++ or Java. The overall performance of the program is dominated by a phase in which C++ and Java are comparable.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!