I was trying to reproduce some of the processor cache effects described in here. I understand that Java is a managed environment, and these examples will not translate exactly,
This is a suboptimal recompilation of a method.
JIT compiler relies on a run-time statistics gathered during interpretation. When main
method is compiled for the first time, the outer loop has not yet completed its first iteration => the run-time statistics tells that the code after the inner loop is never executed, so JIT does not ever bother compiling it. It rather generates an uncommon trap.
When the inner loop ends for the first time, the uncommon trap is hit causing the method to be deoptimized.
On the second iteration of the outer loop the main
method is recompiled with the new knowledge. Now JIT has more statistics and more context to compile. For some reason now it does not cache the value a[0]
in the register (probably because JIT is fooled by the wider context). So it generates addl
instruction to update the array in memory, that is effectively a combination of memory load and store.
On the contrary, during the first compilation JIT caches the value of a[0]
in the register, there is only mov
instruction to store a value in memory (without load).
Fast loop (first iteration):
0x00000000029fc562: mov %ecx,0x10(%r14) <<< array store
0x00000000029fc566: mov %r11d,%edi
0x00000000029fc569: mov %r9d,%ecx
0x00000000029fc56c: add %edi,%ecx
0x00000000029fc56e: mov %ecx,%r11d
0x00000000029fc571: add $0x10,%r11d <<< increment in register
0x00000000029fc575: mov %r11d,0x10(%r14) <<< array store
0x00000000029fc579: add $0x11,%ecx
0x00000000029fc57c: mov %edi,%r11d
0x00000000029fc57f: add $0x10,%r11d
0x00000000029fc583: cmp $0x3ffffff2,%r11d
0x00000000029fc58a: jl 0x00000000029fc562
Slow loop (after recompilation):
0x00000000029fa1b0: addl $0x10,0x10(%r14) <<< increment in memory
0x00000000029fa1b5: add $0x10,%r13d
0x00000000029fa1b9: cmp $0x3ffffff1,%r13d
0x00000000029fa1c0: jl 0x00000000029fa1b0
However this problem seems to be fixed in JDK 9. I've checked this test against a recent JDK 9 Early Access release and verified that it works as expected:
Time for loop# 0: 104 ms
Time for loop# 1: 101 ms
Time for loop# 2: 91 ms
Time for loop# 3: 63 ms
Time for loop# 4: 60 ms
Time for loop# 5: 60 ms
Time for loop# 6: 59 ms
Time for loop# 7: 55 ms
Time for loop# 8: 57 ms
Time for loop# 9: 59 ms