Why does the JVM show more latency for the same block of code after a busy spin pause?

前端 未结 2 1672
无人共我
无人共我 2021-02-08 13:33

The code below demonstrates the problem unequivocally, which is:

The exact same block of code becomes slower after a busy spin pause.

相关标签:
2条回答
  • 2021-02-08 14:15

    You can probably not rely on the precision of any timer for the accuracy you seem to want, https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- states that

    This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis().

    0 讨论(0)
  • 2021-02-08 14:22

    TL;DR

    http://www.brendangregg.com/activebenchmarking.html

    casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

    Problem N1. The very first measurement after the pause change.

    It looks like you are faced with on-stack replacement. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame.

    The root case is wrong microbenchmark - it was not properly warmed up. Just insert the following line into your benchmark before while loop in order to fix it:

    System.out.println("WARMUP = " + busyPause(5000000000L));
    

    How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls. I've modified your code - now it prints interval into system output before every call:

    interval = 1
    interval = 1
    interval = 5000000000
        689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
        689  146       3       JvmPauseLatency::busyPause (19 bytes)
    Installing method (3) JvmPauseLatency.busyPause(J)J 
        698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
    Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
        702  148       4       JvmPauseLatency::busyPause (19 bytes)
        705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    Installing method (4) JvmPauseLatency.busyPause(J)J 
    interval = 5000000000
    interval = 5000000000
    interval = 5000000000
    interval = 5000000000
    

    Usually OSR occurs at tier 4 so in order to disable it you can use the following options:

    • -XX:-TieredCompilation disable tiered compilation
    • -XX:-TieredCompilation -XX:TieredStopAtLevel=3 disable tiered compilation to 4 level
    • -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement disable OSR

    Problem N2. How to measure.

    Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime. In few words:

    • JIT can compile only method - in your test you have one loop, so only OSR is available for your test
    • you are trying to measure something small, maybe smaller than nanoTime() call(see What is the cost of volatile write?)
    • microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time

    So in order to avoid all these pitfalls you can use JMH based benchmark like this:

    import org.openjdk.jmh.annotations.*;
    import org.openjdk.jmh.infra.Blackhole;
    import org.openjdk.jmh.runner.Runner;
    import org.openjdk.jmh.runner.RunnerException;
    import org.openjdk.jmh.runner.options.Options;
    import org.openjdk.jmh.runner.options.OptionsBuilder;
    import org.openjdk.jmh.runner.options.VerboseMode;
    
    import java.util.Random;
    import java.util.concurrent.TimeUnit;
    
    @State(Scope.Benchmark)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
    @Fork(value = 2)
    public class LatencyTest {
    
        public static final long LONG_PAUSE = 5000L;
        public static final long SHORT_PAUSE = 1L;
        public Random rand;
    
        @Setup
        public void initI() {
            rand = new Random(0xDEAD_BEEF);
        }
    
        private long busyPause(long pauseInNanos) {
            Blackhole.consumeCPU(pauseInNanos);
            return pauseInNanos;
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyBusyPauseShort() {
            return busyPause(SHORT_PAUSE);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyBusyPauseLong() {
            return busyPause(LONG_PAUSE);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyFunc() {
            return doCalculation(1);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long measureShort() {
            long x = busyPause(SHORT_PAUSE);
            return doCalculation(x);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long measureLong() {
            long x = busyPause(LONG_PAUSE);
            return doCalculation(x);
        }
    
        private long doCalculation(long x) {
            long calculation = 0;
            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);
            return calculation;
        }
    
        public static void main(String[] args) throws RunnerException {
            Options options = new OptionsBuilder()
                    .include(LatencyTest.class.getName())
                    .verbosity(VerboseMode.NORMAL)
                    .build();
            new Runner(options).run();
        }
    }
    

    Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. So my results are:

    Benchmark                          Mode  Cnt      Score     Error  Units
    LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
    LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
    LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
    LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
    LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op
    

    Please note that the results are almost additive, i.e. latencyFunc + latencyBusyPauseShort = measureShort

    Problem N3. The discrepancy is big.

    What is wrong with your test? It does not warm-up JVM properly, i.e. it uses one parameter to warm-up and another to test. Why is this important? JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. Here is an event from JIT compilation log(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation):

    <branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/> 
    

    After property change JIT uses uncommon trap in order to process your code which is not optimal. I've created a benchmark which is based on original one with minor changes:

    • busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function vdso clock_gettime and we unable to profile this code)
    • all calculations are removed

    _

    import java.util.Arrays;
    
    public class JvmPauseLatency {
    
        private static final int WARMUP = 2000 ;
        private static final int EXTRA = 10;
        private static final long PAUSE = 70000L; // in nanos
        private static volatile long consumedCPU = System.nanoTime();
    
        //org.openjdk.jmh.infra.Blackhole.consumeCPU()
        private static void consumeCPU(long tokens) {
            long t = consumedCPU;
            for (long i = tokens; i > 0; i--) {
                t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
            }
            if (t == 42) {
                consumedCPU += t;
            }
        }
    
        public void run(long warmPause) {
            long[] results = new long[WARMUP + EXTRA];
            int count = 0;
            long interval = warmPause;
            while(count < results.length) {
    
                consumeCPU(interval);
    
                long latency = System.nanoTime();
                latency = System.nanoTime() - latency;
    
                results[count++] = latency;
                if (count == WARMUP) {
                    interval = PAUSE;
                }
            }
    
            System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
        }
    
        public static void main(String[] args) {
            int totalCount = 0;
            while (totalCount < 100) {
                new JvmPauseLatency().run(0);
                totalCount ++;
            }
        }
    }
    

    And results are

    Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
    Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
    Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]
    

    In order to fix this benchmark just replace new JvmPauseLatency().run(0) with new JvmPauseLatency().run(PAUSE); and here is the results:

    Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
    Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]
    

    If you want to change "pause" dynamically - you have to warm-up JVM dynamically, i.e.

        while(count < results.length) {
    
            consumeCPU(interval);
    
            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;
    
            results[count++] = latency;
            if (count >= WARMUP) {
                interval = PAUSE;
            } else {
                interval =  rnd.nextBoolean() ? PAUSE : 0;
            }
        }
    

    Problem N4. What about interpreter -Xint?

    In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. I've made 3 experiments:

    1. random warmup
    2. constant warmup with 0 pause
    3. the whole test uses pause 0 including

    Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency and the results are:

     Performance counter stats for 'java -Xint JvmPauseLatency':
    
       272,822,274,275      cycles                                                      
       723,420,125,590      instructions              #    2.65  insn per cycle         
            26,994,494      cache-references                                            
             8,575,746      cache-misses              #   31.769 % of all cache refs    
         2,060,138,555      bus-cycles                                                  
             2,930,155      branch-misses                                               
    
          86.808481183 seconds time elapsed
    
     Performance counter stats for 'java -Xint JvmPauseLatency':
    
         2,812,949,238      cycles                                                      
         7,267,497,946      instructions              #    2.58  insn per cycle         
             6,936,666      cache-references                                            
             1,107,318      cache-misses              #   15.963 % of all cache refs    
            21,410,797      bus-cycles                                                  
               791,441      branch-misses                                               
    
           0.907758181 seconds time elapsed
    
     Performance counter stats for 'java -Xint JvmPauseLatency':
    
           126,157,793      cycles                                                      
           158,845,300      instructions              #    1.26  insn per cycle         
             6,650,471      cache-references                                            
               909,593      cache-misses              #   13.677 % of all cache refs    
             1,635,548      bus-cycles                                                  
               775,564      branch-misses                                               
    
           0.073511817 seconds time elapsed
    

    In case of branch miss latency and footprint grows non-linearly due to huge memory footprint.

    0 讨论(0)
提交回复
热议问题