Why is 2 * (i * i) faster than 2 * i * i in Java?

前端 未结 10 664
一生所求
一生所求 2020-12-22 14:43

The following Java program takes on average between 0.50 secs and 0.55 secs to run:

public static void main(String[] args) {
    long startTime = System.nano         


        
相关标签:
10条回答
  • 2020-12-22 14:44

    I got similar results:

    2 * (i * i): 0.458765943 s, n=119860736
    2 * i * i: 0.580255126 s, n=119860736
    

    I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

    Finally, here is a javap -c -v <.java> decompile of each:

         3: ldc           #3                  // String 2 * (i * i):
         5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
         8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
         8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
        11: lstore_1
        12: iconst_0
        13: istore_3
        14: iconst_0
        15: istore        4
        17: iload         4
        19: ldc           #6                  // int 1000000000
        21: if_icmpge     40
        24: iload_3
        25: iconst_2
        26: iload         4
        28: iload         4
        30: imul
        31: imul
        32: iadd
        33: istore_3
        34: iinc          4, 1
        37: goto          17
    

    vs.

         3: ldc           #3                  // String 2 * i * i:
         5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
         8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
        11: lstore_1
        12: iconst_0
        13: istore_3
        14: iconst_0
        15: istore        4
        17: iload         4
        19: ldc           #6                  // int 1000000000
        21: if_icmpge     40
        24: iload_3
        25: iconst_2
        26: iload         4
        28: imul
        29: iload         4
        31: imul
        32: iadd
        33: istore_3
        34: iinc          4, 1
        37: goto          17
    

    FYI -

    java -version
    java version "1.8.0_121"
    Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
    
    0 讨论(0)
  • 2020-12-22 14:45

    (Editor's note: this answer is contradicted by evidence from looking at the asm, as shown by another answer. This was a guess backed up by some experiments, but it turned out not to be correct.)


    When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    

    but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the n += addition.

    Here are a few reasons why I think this is the case:

    • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same
    • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

    Here is the test code that I used to draw these conclusions:

    public static void main(String[] args) {
        long fastVersion = 0;
        long slowVersion = 0;
        long optimizedVersion = 0;
        long modifiedFastVersion = 0;
        long modifiedSlowVersion = 0;
    
        for (int i = 0; i < 10; i++) {
            fastVersion += fastVersion();
            slowVersion += slowVersion();
            optimizedVersion += optimizedVersion();
            modifiedFastVersion += modifiedFastVersion();
            modifiedSlowVersion += modifiedSlowVersion();
        }
    
        System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
        System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
        System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
        System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
        System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
    }
    
    private static long fastVersion() {
        long startTime = System.nanoTime();
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * (i * i);
        }
        return System.nanoTime() - startTime;
    }
    
    private static long slowVersion() {
        long startTime = System.nanoTime();
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * i * i;
        }
        return System.nanoTime() - startTime;
    }
    
    private static long optimizedVersion() {
        long startTime = System.nanoTime();
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += i * i;
        }
        n *= 2;
        return System.nanoTime() - startTime;
    }
    
    private static long modifiedFastVersion() {
        long startTime = System.nanoTime();
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            if (n == 0) n = 1;
            n += 2 * (i * i);
        }
        return System.nanoTime() - startTime;
    }
    
    private static long modifiedSlowVersion() {
        long startTime = System.nanoTime();
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            if (n == 0) n = 1;
            n += 2 * i * i;
        }
        return System.nanoTime() - startTime;
    }
    

    And here are the results:

    Fast version: 5.7274411 s
    Slow version: 7.6190804 s
    Optimized version: 5.1348007 s
    Modified fast version: 7.1492705 s
    Modified slow version: 7.2952668 s
    
    0 讨论(0)
  • 2020-12-22 14:48

    There is a slight difference in the ordering of the bytecode.

    2 * (i * i):

         iconst_2
         iload0
         iload0
         imul
         imul
         iadd
    

    vs 2 * i * i:

         iconst_2
         iload0
         imul
         iload0
         imul
         iadd
    

    At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

    So we need to dig deeper into the lower level (JIT)1.

    Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

    030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
    030     addl    R11, RBP    # int
    033     movl    RBP, R13    # spill
    036     addl    RBP, #14    # int
    039     imull   RBP, RBP    # int
    03c     movl    R9, R13 # spill
    03f     addl    R9, #13 # int
    043     imull   R9, R9  # int
    047     sall    RBP, #1
    049     sall    R9, #1
    04c     movl    R8, R13 # spill
    04f     addl    R8, #15 # int
    053     movl    R10, R8 # spill
    056     movdl   XMM1, R8    # spill
    05b     imull   R10, R8 # int
    05f     movl    R8, R13 # spill
    062     addl    R8, #12 # int
    066     imull   R8, R8  # int
    06a     sall    R10, #1
    06d     movl    [rsp + #32], R10    # spill
    072     sall    R8, #1
    075     movl    RBX, R13    # spill
    078     addl    RBX, #11    # int
    07b     imull   RBX, RBX    # int
    07e     movl    RCX, R13    # spill
    081     addl    RCX, #10    # int
    084     imull   RCX, RCX    # int
    087     sall    RBX, #1
    089     sall    RCX, #1
    08b     movl    RDX, R13    # spill
    08e     addl    RDX, #8 # int
    091     imull   RDX, RDX    # int
    094     movl    RDI, R13    # spill
    097     addl    RDI, #7 # int
    09a     imull   RDI, RDI    # int
    09d     sall    RDX, #1
    09f     sall    RDI, #1
    0a1     movl    RAX, R13    # spill
    0a4     addl    RAX, #6 # int
    0a7     imull   RAX, RAX    # int
    0aa     movl    RSI, R13    # spill
    0ad     addl    RSI, #4 # int
    0b0     imull   RSI, RSI    # int
    0b3     sall    RAX, #1
    0b5     sall    RSI, #1
    0b7     movl    R10, R13    # spill
    0ba     addl    R10, #2 # int
    0be     imull   R10, R10    # int
    0c2     movl    R14, R13    # spill
    0c5     incl    R14 # int
    0c8     imull   R14, R14    # int
    0cc     sall    R10, #1
    0cf     sall    R14, #1
    0d2     addl    R14, R11    # int
    0d5     addl    R14, R10    # int
    0d8     movl    R10, R13    # spill
    0db     addl    R10, #3 # int
    0df     imull   R10, R10    # int
    0e3     movl    R11, R13    # spill
    0e6     addl    R11, #5 # int
    0ea     imull   R11, R11    # int
    0ee     sall    R10, #1
    0f1     addl    R10, R14    # int
    0f4     addl    R10, RSI    # int
    0f7     sall    R11, #1
    0fa     addl    R11, R10    # int
    0fd     addl    R11, RAX    # int
    100     addl    R11, RDI    # int
    103     addl    R11, RDX    # int
    106     movl    R10, R13    # spill
    109     addl    R10, #9 # int
    10d     imull   R10, R10    # int
    111     sall    R10, #1
    114     addl    R10, R11    # int
    117     addl    R10, RCX    # int
    11a     addl    R10, RBX    # int
    11d     addl    R10, R8 # int
    120     addl    R9, R10 # int
    123     addl    RBP, R9 # int
    126     addl    RBP, [RSP + #32 (32-bit)]   # int
    12a     addl    R13, #16    # int
    12e     movl    R11, R13    # spill
    131     imull   R11, R13    # int
    135     sall    R11, #1
    138     cmpl    R13, #999999985
    13f     jl     B2   # loop end  P=1.000000 C=6554623.000000
    

    We see that there is 1 register that is "spilled" onto the stack.

    And for the 2 * i * i version:

    05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
    05a     addl    RBX, R11    # int
    05d     movl    [rsp + #32], RBX    # spill
    061     movl    R11, R8 # spill
    064     addl    R11, #15    # int
    068     movl    [rsp + #36], R11    # spill
    06d     movl    R11, R8 # spill
    070     addl    R11, #14    # int
    074     movl    R10, R9 # spill
    077     addl    R10, #16    # int
    07b     movdl   XMM2, R10   # spill
    080     movl    RCX, R9 # spill
    083     addl    RCX, #14    # int
    086     movdl   XMM1, RCX   # spill
    08a     movl    R10, R9 # spill
    08d     addl    R10, #12    # int
    091     movdl   XMM4, R10   # spill
    096     movl    RCX, R9 # spill
    099     addl    RCX, #10    # int
    09c     movdl   XMM6, RCX   # spill
    0a0     movl    RBX, R9 # spill
    0a3     addl    RBX, #8 # int
    0a6     movl    RCX, R9 # spill
    0a9     addl    RCX, #6 # int
    0ac     movl    RDX, R9 # spill
    0af     addl    RDX, #4 # int
    0b2     addl    R9, #2  # int
    0b6     movl    R10, R14    # spill
    0b9     addl    R10, #22    # int
    0bd     movdl   XMM3, R10   # spill
    0c2     movl    RDI, R14    # spill
    0c5     addl    RDI, #20    # int
    0c8     movl    RAX, R14    # spill
    0cb     addl    RAX, #32    # int
    0ce     movl    RSI, R14    # spill
    0d1     addl    RSI, #18    # int
    0d4     movl    R13, R14    # spill
    0d7     addl    R13, #24    # int
    0db     movl    R10, R14    # spill
    0de     addl    R10, #26    # int
    0e2     movl    [rsp + #40], R10    # spill
    0e7     movl    RBP, R14    # spill
    0ea     addl    RBP, #28    # int
    0ed     imull   RBP, R11    # int
    0f1     addl    R14, #30    # int
    0f5     imull   R14, [RSP + #36 (32-bit)]   # int
    0fb     movl    R10, R8 # spill
    0fe     addl    R10, #11    # int
    102     movdl   R11, XMM3   # spill
    107     imull   R11, R10    # int
    10b     movl    [rsp + #44], R11    # spill
    110     movl    R10, R8 # spill
    113     addl    R10, #10    # int
    117     imull   RDI, R10    # int
    11b     movl    R11, R8 # spill
    11e     addl    R11, #8 # int
    122     movdl   R10, XMM2   # spill
    127     imull   R10, R11    # int
    12b     movl    [rsp + #48], R10    # spill
    130     movl    R10, R8 # spill
    133     addl    R10, #7 # int
    137     movdl   R11, XMM1   # spill
    13c     imull   R11, R10    # int
    140     movl    [rsp + #52], R11    # spill
    145     movl    R11, R8 # spill
    148     addl    R11, #6 # int
    14c     movdl   R10, XMM4   # spill
    151     imull   R10, R11    # int
    155     movl    [rsp + #56], R10    # spill
    15a     movl    R10, R8 # spill
    15d     addl    R10, #5 # int
    161     movdl   R11, XMM6   # spill
    166     imull   R11, R10    # int
    16a     movl    [rsp + #60], R11    # spill
    16f     movl    R11, R8 # spill
    172     addl    R11, #4 # int
    176     imull   RBX, R11    # int
    17a     movl    R11, R8 # spill
    17d     addl    R11, #3 # int
    181     imull   RCX, R11    # int
    185     movl    R10, R8 # spill
    188     addl    R10, #2 # int
    18c     imull   RDX, R10    # int
    190     movl    R11, R8 # spill
    193     incl    R11 # int
    196     imull   R9, R11 # int
    19a     addl    R9, [RSP + #32 (32-bit)]    # int
    19f     addl    R9, RDX # int
    1a2     addl    R9, RCX # int
    1a5     addl    R9, RBX # int
    1a8     addl    R9, [RSP + #60 (32-bit)]    # int
    1ad     addl    R9, [RSP + #56 (32-bit)]    # int
    1b2     addl    R9, [RSP + #52 (32-bit)]    # int
    1b7     addl    R9, [RSP + #48 (32-bit)]    # int
    1bc     movl    R10, R8 # spill
    1bf     addl    R10, #9 # int
    1c3     imull   R10, RSI    # int
    1c7     addl    R10, R9 # int
    1ca     addl    R10, RDI    # int
    1cd     addl    R10, [RSP + #44 (32-bit)]   # int
    1d2     movl    R11, R8 # spill
    1d5     addl    R11, #12    # int
    1d9     imull   R13, R11    # int
    1dd     addl    R13, R10    # int
    1e0     movl    R10, R8 # spill
    1e3     addl    R10, #13    # int
    1e7     imull   R10, [RSP + #40 (32-bit)]   # int
    1ed     addl    R10, R13    # int
    1f0     addl    RBP, R10    # int
    1f3     addl    R14, RBP    # int
    1f6     movl    R10, R8 # spill
    1f9     addl    R10, #16    # int
    1fd     cmpl    R10, #999999985
    204     jl     B2   # loop end  P=1.000000 C=7419903.000000
    

    Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

    Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.


    But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

    So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

    In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

    The gain in performance due to the µop cache can be quite considerable if the average instruction length is more than 4 bytes. The following methods of optimizing the use of the µop cache may be considered:

    • Make sure that critical loops are small enough to fit into the µop cache.
    • Align the most critical loop entries and function entries by 32.
    • Avoid unnecessary loop unrolling.
    • Avoid instructions that have extra load time
      . . .

    Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

    But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)2:

      vmovdqa ymm0, YMMWORD PTR .LC0[rip]
      vmovdqa ymm3, YMMWORD PTR .LC1[rip]
      xor eax, eax
      vpxor xmm2, xmm2, xmm2
    .L2:
      vpmulld ymm1, ymm0, ymm0
      inc eax
      vpaffffd ymm0, ymm0, ymm3
      vpslld ymm1, ymm1, 1
      vpaffffd ymm2, ymm2, ymm1
      cmp eax, 125000000      ; 8 calculations per iteration
      jne .L2
      vmovdqa xmm0, xmm2
      vextracti128 xmm2, ymm2, 1
      vpaffffd xmm2, xmm0, xmm2
      vpsrldq xmm0, xmm2, 8
      vpaffffd xmm0, xmm2, xmm0
      vpsrldq xmm1, xmm0, 4
      vpaffffd xmm0, xmm0, xmm1
      vmovd eax, xmm0
      vzeroupper
    

    With run times:

    • SSE: 0.24 s, or 2 times as fast.
    • AVX: 0.15 s, or 3 times as fast.
    • AVX2: 0.08 s, or 5 times as fast.

    1 To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly

    2 The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.

    0 讨论(0)
  • 2020-12-22 14:57

    While not directly related to the question's environment, just for the curiosity, I did the same test on .NET Core 2.1, x64, release mode.

    Here is the interesting result, confirming similar phonomena (other way around) happening over the dark side of the force. Code:

    static void Main(string[] args)
    {
        Stopwatch watch = new Stopwatch();
    
        Console.WriteLine("2 * (i * i)");
    
        for (int a = 0; a < 10; a++)
        {
            int n = 0;
    
            watch.Restart();
    
            for (int i = 0; i < 1000000000; i++)
            {
                n += 2 * (i * i);
            }
    
            watch.Stop();
    
            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
        }
    
        Console.WriteLine();
        Console.WriteLine("2 * i * i");
    
        for (int a = 0; a < 10; a++)
        {
            int n = 0;
    
            watch.Restart();
    
            for (int i = 0; i < 1000000000; i++)
            {
                n += 2 * i * i;
            }
    
            watch.Stop();
    
            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
        }
    }
    

    Result:

    2 * (i * i)

    • result:119860736, 438 ms
    • result:119860736, 433 ms
    • result:119860736, 437 ms
    • result:119860736, 435 ms
    • result:119860736, 436 ms
    • result:119860736, 435 ms
    • result:119860736, 435 ms
    • result:119860736, 439 ms
    • result:119860736, 436 ms
    • result:119860736, 437 ms

    2 * i * i

    • result:119860736, 417 ms
    • result:119860736, 417 ms
    • result:119860736, 417 ms
    • result:119860736, 418 ms
    • result:119860736, 418 ms
    • result:119860736, 417 ms
    • result:119860736, 418 ms
    • result:119860736, 416 ms
    • result:119860736, 417 ms
    • result:119860736, 418 ms
    0 讨论(0)
  • 2020-12-22 15:01

    Kasperd asked in a comment of the accepted answer:

    The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

    xor edx, edx
    xor eax, eax
    .L2:
    mov ecx, edx
    imul ecx, edx
    add edx, 1
    lea eax, [rax+rcx*2]
    cmp edx, 1000000000
    jne .L2
    

    I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

    R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

    0 讨论(0)
  • 2020-12-22 15:01

    More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:

    java version "1.8.0_191"
    Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))
    Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
    

    And this shows very similar results:

    0.374653912 s
    n = 119860736
    0.447778698 s
    n = 119860736
    

    (second results using 2 * i * i).

    Interestingly enough, when running on the same machine, but using Oracle Java:

    Java version "1.8.0_181"
    Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
    

    results are on average a bit slower:

    0.414331815 s
    n = 119860736
    0.491430656 s
    n = 119860736
    

    Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.

    0 讨论(0)
提交回复
热议问题