I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.
This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:
@State(Scope.Benchmark)
public class Streams {
@Param({"500", "520"})
int iterations;
@Setup
public void init() {
for (int i = 0; i < iterations; i++) {
Stream.empty().reduce((x, y) -> x);
}
}
@Benchmark
public long loop() {
return Stream.empty().count();
}
}
Though iteration
parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:
Benchmark (iterations) Mode Cnt Score Error Units
Streams.loop 500 thrpt 5 29491,039 ± 240,953 ops/ms
Streams.loop 520 thrpt 5 11867,860 ± 344,779 ops/ms
Now let's run JMH with -prof perfasm
option to see the hottest code regions:
Fast case (iterations = 500):
....[Hottest Methods (after inlining)]..................................
48,66% bench.generated.Streams_loop::loop_thrpt_jmhStub
23,14% <unknown>
2,99% java.util.stream.Sink$ChainedReference::<init>
1,98% org.openjdk.jmh.infra.Blackhole::consume
1,68% java.util.Objects::requireNonNull
0,65% java.util.stream.AbstractPipeline::evaluate
Slow case (iterations = 520):
....[Hottest Methods (after inlining)]..................................
40,09% java.util.stream.ReduceOps$ReduceOp::evaluateSequential
22,02% <unknown>
17,61% bench.generated.Streams_loop::loop_thrpt_jmhStub
1,25% org.openjdk.jmh.infra.Blackhole::consume
0,74% java.util.stream.AbstractPipeline::evaluate
Looks like the slow case spends the most time in ReduceOp.evaluateSequential
method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast
.
You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.
Let's use -XX:+PrintMethodData
(available in debug JVM) to compare the execution profiles:
----- Fast case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
interpreter_invocation_count: 13382
invocation_counter: 13382
backedge_counter: 0
mdo size: 552 bytes
0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;>
0 bci: 2 VirtualCallData count(0) entries(1)
'java/util/stream/ReduceOps$8'(12870 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;>
48 bci: 6 VirtualCallData count(0) entries(1)
'java/util/stream/ReferencePipeline$5'(12870 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
96 bci: 9 ReceiverTypeData count(0) entries(1)
'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;>
144 bci: 12 VirtualCallData count(0) entries(1)
'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
17 areturn
----- Slow case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
interpreter_invocation_count: 54751
invocation_counter: 54751
backedge_counter: 0
mdo size: 552 bytes
0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;>
0 bci: 2 VirtualCallData count(0) entries(2)
'java/util/stream/ReduceOps$2'(16 0.00)
'java/util/stream/ReduceOps$8'(54223 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;>
48 bci: 6 VirtualCallData count(0) entries(2)
'java/util/stream/ReferencePipeline$Head'(16 0.00)
'java/util/stream/ReferencePipeline$5'(54223 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
96 bci: 9 ReceiverTypeData count(0) entries(2)
'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;>
144 bci: 12 VirtualCallData count(0) entries(2)
'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
17 areturn
You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.
The same is true for your original benchmark: longer stream operations in init()
method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try
-XX:-ProfileInterpreter
-XX:Tier3InvocationThreshold=1000
-XX:-TieredCompilation
Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.