Why is 2 * (i * i) faster than 2 * i * i in Java?

前端 未结 10 665
一生所求
一生所求 2020-12-22 14:43

The following Java program takes on average between 0.50 secs and 0.55 secs to run:

public static void main(String[] args) {
    long startTime = System.nano         


        
相关标签:
10条回答
  • 2020-12-22 15:02

    I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation.

    @State(Scope.Benchmark)
    @Warmup(iterations = 2)
    @Fork(1)
    @Measurement(iterations = 10)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    //@BenchmarkMode({ Mode.All })
    @BenchmarkMode(Mode.AverageTime)
    public class MyBenchmark {
      @Param({ "100", "1000", "1000000000" })
      private int size;
    
      @Benchmark
      public int two_square_i() {
        int n = 0;
        for (int i = 0; i < size; i++) {
          n += 2 * (i * i);
        }
        return n;
      }
    
      @Benchmark
      public int square_i_two() {
        int n = 0;
        for (int i = 0; i < size; i++) {
          n += i * i;
        }
        return 2*n;
      }
    
      @Benchmark
      public int two_i_() {
        int n = 0;
        for (int i = 0; i < size; i++) {
          n += 2 * i * i;
        }
        return n;
      }
    }
    

    The result are here:

    Benchmark                           (size)  Mode  Samples          Score   Score error  Units
    o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op
    o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op
    o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op
    o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op
    o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op
    o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op
    o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op
    o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op
    o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op
    

    On my PC (Core i7 860 - it is doing nothing much apart from reading on my smartphone):

    • n += i*i then n*2 is first
    • 2 * (i * i) is second.

    The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer).

    Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

    • Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
    • Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

    I am not expert on bytecode, but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, and there is no need to load it again) whilst in the 2*i*i it can't.

    0 讨论(0)
  • 2020-12-22 15:02

    The two methods of adding do generate slightly different byte code:

      17: iconst_2
      18: iload         4
      20: iload         4
      22: imul
      23: imul
      24: iadd
    

    For 2 * (i * i) vs:

      17: iconst_2
      18: iload         4
      20: imul
      21: iload         4
      23: imul
      24: iadd
    

    For 2 * i * i.

    And when using a JMH benchmark like this:

    @Warmup(iterations = 5, batchSize = 1)
    @Measurement(iterations = 5, batchSize = 1)
    @Fork(1)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    @State(Scope.Benchmark)
    public class MyBenchmark {
    
        @Benchmark
        public int noBrackets() {
            int n = 0;
            for (int i = 0; i < 1000000000; i++) {
                n += 2 * i * i;
            }
            return n;
        }
    
        @Benchmark
        public int brackets() {
            int n = 0;
            for (int i = 0; i < 1000000000; i++) {
                n += 2 * (i * i);
            }
            return n;
        }
    
    }
    

    The difference is clear:

    # JMH version: 1.21
    # VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
    # VM options: <none>
    
    Benchmark                      (n)  Mode  Cnt    Score    Error  Units
    MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
    MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op
    

    What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)

    Running again with Graal:

    # JMH version: 1.21
    # VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
    # VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler
    
    Benchmark                      (n)  Mode  Cnt    Score    Error  Units
    MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
    MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op
    

    You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.

    So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.

    0 讨论(0)
  • 2020-12-22 15:04

    Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer

    On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain:

    public static void main(String[] args) {
        int repeat = 10;
        long A = 0;
        long B = 0;
        for (int i = 0; i < repeat; i++) {
            A += test();
            B += testB();
        }
    
        System.out.println(A / repeat + " ms");
        System.out.println(B / repeat + " ms");
    }
    
    
    private static long test() {
        int n = 0;
        for (int i = 0; i < 1000; i++) {
            n += multi(i);
        }
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < 1000000000; i++) {
            n += multi(i);
        }
        long ms = (System.currentTimeMillis() - startTime);
        System.out.println(ms + " ms A " + n);
        return ms;
    }
    
    
    private static long testB() {
        int n = 0;
        for (int i = 0; i < 1000; i++) {
            n += multiB(i);
        }
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < 1000000000; i++) {
            n += multiB(i);
        }
        long ms = (System.currentTimeMillis() - startTime);
        System.out.println(ms + " ms B " + n);
        return ms;
    }
    
    private static int multiB(int i) {
        return 2 * (i * i);
    }
    
    private static int multi(int i) {
        return 2 * i * i;
    }
    

    Output:

    ...
    405 ms A 785527736
    327 ms B 785527736
    404 ms A 785527736
    329 ms B 785527736
    404 ms A 785527736
    328 ms B 785527736
    404 ms A 785527736
    328 ms B 785527736
    410 ms
    333 ms
    

    So why? The byte code is this:

     private static multiB(int arg0) { // 2 * (i * i)
         <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
    
         L1 {
             iconst_2
             iload0
             iload0
             imul
             imul
             ireturn
         }
         L2 {
         }
     }
    
     private static multi(int arg0) { // 2 * i * i
         <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
    
         L1 {
             iconst_2
             iload0
             imul
             iload0
             imul
             ireturn
         }
         L2 {
         }
     }
    

    The difference being: With brackets (2 * (i * i)):

    • push const stack
    • push local on stack
    • push local on stack
    • multiply top of stack
    • multiply top of stack

    Without brackets (2 * i * i):

    • push const stack
    • push local on stack
    • multiply top of stack
    • push local on stack
    • multiply top of stack

    Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it.

    0 讨论(0)
  • 2020-12-22 15:08

    Interesting observation using Java 11 and switching off loop unrolling with the following VM option:

    -XX:LoopUnrollLimit=0
    

    The loop with the 2 * (i * i) expression results in more compact native code1:

    L0001: add    eax,r11d
           inc    r8d
           mov    r11d,r8d
           imul   r11d,r8d
           shl    r11d,1h
           cmp    r8d,r10d
           jl     L0001
    

    in comparison with the 2 * i * i version:

    L0001: add    eax,r11d
           mov    r11d,r8d
           shl    r11d,1h
           add    r11d,2h
           inc    r8d
           imul   r11d,r8d
           cmp    r8d,r10d
           jl     L0001
    

    Java version:

    java version "11" 2018-09-25
    Java(TM) SE Runtime Environment 18.9 (build 11+28)
    Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
    

    Benchmark results:

    Benchmark          (size)  Mode  Cnt    Score     Error  Units
    LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op
    LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op
    

    Benchmark source code:

    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    @Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
    @State(Scope.Thread)
    @Fork(1)
    public class LoopTest {
    
        @Param("1000000000") private int size;
    
        public static void main(String[] args) throws RunnerException {
            Options opt = new OptionsBuilder()
                .include(LoopTest.class.getSimpleName())
                .jvmArgs("-XX:LoopUnrollLimit=0")
                .build();
            new Runner(opt).run();
        }
    
        @Benchmark
        public int slow() {
            int n = 0;
            for (int i = 0; i < size; i++)
                n += 2 * i * i;
            return n;
        }
    
        @Benchmark
        public int fast() {
            int n = 0;
            for (int i = 0; i < size; i++)
                n += 2 * (i * i);
            return n;
        }
    }
    

    1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0

    0 讨论(0)
提交回复
热议问题