Java for-loop optimization

后端 未结 1 1454
无人共我
无人共我 2020-12-06 21:25

I made some runtime tests with java for loops and recognized a strange behaviour. For my code I need wrapper objects for primitive types like int, double and so on, to simul

相关标签:
1条回答
  • 2020-12-06 21:55

    It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

    static class HDouble {
        double value;
    }
    
    public static void main(String[] args) {
        primitive();
        wrapper();
    }
    
    public static void primitive() {
        long start = System.nanoTime();
        for (double d = 0; d < 1000000000; d++) {
        }
        long end = System.nanoTime();
        System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
    }
    
    public static void wrapper() {
        HDouble d = new HDouble();
        long start = System.nanoTime();
        for (d.value = 0; d.value < 1000000000; d.value++) {
        }
        long end = System.nanoTime();
        System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
    }
    

    The results are somewhat similar to yours:

    Primitive: 3.618 s
    Wrapper:   1.380 s
    

    Now repeat the test several times:

    public static void main(String[] args) {
        for (int i = 0; i < 5; i++) {
            primitive();
            wrapper();
        }
    }
    

    It gets more interesting:

    Primitive: 3.661 s
    Wrapper:   1.382 s
    Primitive: 3.461 s
    Wrapper:   1.380 s
    Primitive: 1.376 s <-- starting from 3rd iteration
    Wrapper:   1.381 s <-- the timings become equal
    Primitive: 1.371 s
    Wrapper:   1.372 s
    Primitive: 1.379 s
    Wrapper:   1.378 s
    

    Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity: -XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

        136    1 %           Test::primitive @ 6 (53 bytes)
       3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
    Primitive: 3.589 s
       3748    2 %           Test::wrapper @ 17 (73 bytes)
       5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
    Wrapper:   1.374 s
       5122    3             Test::primitive (53 bytes)
       5124    4 %           Test::primitive @ 6 (53 bytes)
    Primitive: 3.421 s
       8544    5             Test::wrapper (73 bytes)
       8547    6 %           Test::wrapper @ 17 (73 bytes)
    Wrapper:   1.378 s
    Primitive: 1.372 s
    Wrapper:   1.375 s
    Primitive: 1.378 s
    Wrapper:   1.373 s
    Primitive: 1.375 s
    Wrapper:   1.378 s
    

    Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

    What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

    But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
    -XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

    I'll omit all unrelevant code leaving only the compiled loop.

    Primitive:

    0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
    0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
    0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
    0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
    0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
    0x00000000023e90f0: ja     0x00000000023e90d0
    

    Wrapper:

    0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
    0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
    0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
    0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
    0x00000000023ebea7: ja     0x00000000023ebe90
    

    As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

    0 讨论(0)
提交回复
热议问题