Java 8 stream unpredictable performance drop with no obvious reason

前端 未结 1 1208
囚心锁ツ
囚心锁ツ 2020-12-22 23:29

I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.

相关标签:
1条回答
  • 2020-12-22 23:41

    This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:

    @State(Scope.Benchmark)
    public class Streams {
        @Param({"500", "520"})
        int iterations;
    
        @Setup
        public void init() {
            for (int i = 0; i < iterations; i++) {
                Stream.empty().reduce((x, y) -> x);
            }
        }
    
        @Benchmark
        public long loop() {
            return Stream.empty().count();
        }
    }
    

    Though iteration parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:

    Benchmark     (iterations)   Mode  Cnt      Score     Error   Units
    Streams.loop           500  thrpt    5  29491,039 ± 240,953  ops/ms
    Streams.loop           520  thrpt    5  11867,860 ± 344,779  ops/ms
    

    Now let's run JMH with -prof perfasm option to see the hottest code regions:

    Fast case (iterations = 500):

    ....[Hottest Methods (after inlining)]..................................
     48,66%  bench.generated.Streams_loop::loop_thrpt_jmhStub
     23,14%  <unknown>
      2,99%  java.util.stream.Sink$ChainedReference::<init>
      1,98%  org.openjdk.jmh.infra.Blackhole::consume
      1,68%  java.util.Objects::requireNonNull
      0,65%  java.util.stream.AbstractPipeline::evaluate
    

    Slow case (iterations = 520):

    ....[Hottest Methods (after inlining)]..................................
     40,09%  java.util.stream.ReduceOps$ReduceOp::evaluateSequential
     22,02%  <unknown>
     17,61%  bench.generated.Streams_loop::loop_thrpt_jmhStub
      1,25%  org.openjdk.jmh.infra.Blackhole::consume
      0,74%  java.util.stream.AbstractPipeline::evaluate
    

    Looks like the slow case spends the most time in ReduceOp.evaluateSequential method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast.

    You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.

    Let's use -XX:+PrintMethodData (available in debug JVM) to compare the execution profiles:

    ----- Fast case -----
    java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
      interpreter_invocation_count:    13382 
      invocation_counter:              13382 
      backedge_counter:                    0 
      mdo size: 552 bytes
    
    0 aload_1
    1 fast_aload_0
    2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
      0   bci: 2    VirtualCallData     count(0) entries(1)
                                        'java/util/stream/ReduceOps$8'(12870 1.00)
    5 aload_2
    6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
      48  bci: 6    VirtualCallData     count(0) entries(1)
                                        'java/util/stream/ReferencePipeline$5'(12870 1.00)
    9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
      96  bci: 9    ReceiverTypeData    count(0) entries(1)
                                        'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
    12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
      144 bci: 12   VirtualCallData     count(0) entries(1)
                                        'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
    17 areturn
    
    ----- Slow case -----
    java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
      interpreter_invocation_count:    54751 
      invocation_counter:              54751 
      backedge_counter:                    0 
      mdo size: 552 bytes
    
    0 aload_1
    1 fast_aload_0
    2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
      0   bci: 2    VirtualCallData     count(0) entries(2)
                                        'java/util/stream/ReduceOps$2'(16 0.00)
                                        'java/util/stream/ReduceOps$8'(54223 1.00)
    5 aload_2
    6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
      48  bci: 6    VirtualCallData     count(0) entries(2)
                                        'java/util/stream/ReferencePipeline$Head'(16 0.00)
                                        'java/util/stream/ReferencePipeline$5'(54223 1.00)
    9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
      96  bci: 9    ReceiverTypeData    count(0) entries(2)
                                        'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                        'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
    12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
      144 bci: 12   VirtualCallData     count(0) entries(2)
                                        'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                        'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
    17 areturn
    

    You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.

    The same is true for your original benchmark: longer stream operations in init() method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try

    1. -XX:-ProfileInterpreter
    2. -XX:Tier3InvocationThreshold=1000
    3. -XX:-TieredCompilation

    Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.

    0 讨论(0)
提交回复
热议问题