The following Java program takes on average between 0.50 secs and 0.55 secs to run:
public static void main(String[] args) {
long startTime = System.nano
I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation.
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860 - it is doing nothing much apart from reading on my smartphone):
n += i*i
then n*2
is first2 * (i * i)
is second.The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
I am not expert on bytecode, but we iload_2
before we imul
: that's probably where you get the difference: I can suppose that the JVM optimize reading i
twice (i
is already here, and there is no need to load it again) whilst in the 2*i*i
it can't.
The two methods of adding do generate slightly different byte code:
17: iconst_2
18: iload 4
20: iload 4
22: imul
23: imul
24: iadd
For 2 * (i * i)
vs:
17: iconst_2
18: iload 4
20: imul
21: iload 4
23: imul
24: iadd
For 2 * i * i
.
And when using a JMH benchmark like this:
@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {
@Benchmark
public int noBrackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return n;
}
@Benchmark
public int brackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return n;
}
}
The difference is clear:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 380.889 ± 58.011 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 512.464 ± 11.098 ms/op
What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)
Running again with Graal:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 335.100 ± 23.085 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 331.163 ± 50.670 ms/op
You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.
So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.
Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain:
public static void main(String[] args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why? The byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)
):
Without brackets (2 * i * i
):
Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it.
Interesting observation using Java 11 and switching off loop unrolling with the following VM option:
-XX:LoopUnrollLimit=0
The loop with the 2 * (i * i)
expression results in more compact native code1:
L0001: add eax,r11d
inc r8d
mov r11d,r8d
imul r11d,r8d
shl r11d,1h
cmp r8d,r10d
jl L0001
in comparison with the 2 * i * i
version:
L0001: add eax,r11d
mov r11d,r8d
shl r11d,1h
add r11d,2h
inc r8d
imul r11d,r8d
cmp r8d,r10d
jl L0001
Java version:
java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
Benchmark results:
Benchmark (size) Mode Cnt Score Error Units
LoopTest.fast 1000000000 avgt 5 694,868 ± 36,470 ms/op
LoopTest.slow 1000000000 avgt 5 769,840 ± 135,006 ms/op
Benchmark source code:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class LoopTest {
@Param("1000000000") private int size;
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(LoopTest.class.getSimpleName())
.jvmArgs("-XX:LoopUnrollLimit=0")
.build();
new Runner(opt).run();
}
@Benchmark
public int slow() {
int n = 0;
for (int i = 0; i < size; i++)
n += 2 * i * i;
return n;
}
@Benchmark
public int fast() {
int n = 0;
for (int i = 0; i < size; i++)
n += 2 * (i * i);
return n;
}
}
1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0