Why does the JVM show more latency for the same block of code after a busy spin pause?

前端未结

关注

 2  1676

The code below demonstrates the problem unequivocally, which is:

The exact same block of code becomes slower after a busy spin pause.

相关标签:

2条回答

臣服心动

2021-02-08 14:15

You can probably not rely on the precision of any timer for the accuracy you seem to want, https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- states that

This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis().

0 讨论(0)

发布评论:

提交评论

加载中...

南旧

2021-02-08 14:22

TL;DR

http://www.brendangregg.com/activebenchmarking.html

casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

Problem N1. The very first measurement after the pause change.

It looks like you are faced with on-stack replacement. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame.

The root case is wrong microbenchmark - it was not properly warmed up. Just insert the following line into your benchmark before while loop in order to fix it:

System.out.println("WARMUP = " + busyPause(5000000000L));

How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls. I've modified your code - now it prints interval into system output before every call:

interval = 1 interval = 1 interval = 5000000000 689 145 4 JvmPauseLatency::busyPause (19 bytes) made not entrant 689 146 3 JvmPauseLatency::busyPause (19 bytes) Installing method (3) JvmPauseLatency.busyPause(J)J 698 147 % 4 JvmPauseLatency::busyPause @ 6 (19 bytes) Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6 702 148 4 JvmPauseLatency::busyPause (19 bytes) 705 146 3 JvmPauseLatency::busyPause (19 bytes) made not entrant Installing method (4) JvmPauseLatency.busyPause(J)J interval = 5000000000 interval = 5000000000 interval = 5000000000 interval = 5000000000

Usually OSR occurs at tier 4 so in order to disable it you can use the following options:

-XX:-TieredCompilation disable tiered compilation

-XX:-TieredCompilation -XX:TieredStopAtLevel=3 disable tiered compilation to 4 level

-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement disable OSR

Problem N2. How to measure.

Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime. In few words:

JIT can compile only method - in your test you have one loop, so only OSR is available for your test

you are trying to measure something small, maybe smaller than nanoTime() call(see What is the cost of volatile write?)

microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time

So in order to avoid all these pitfalls you can use JMH based benchmark like this:

import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.RunnerException; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; import org.openjdk.jmh.runner.options.VerboseMode; import java.util.Random; import java.util.concurrent.TimeUnit; @State(Scope.Benchmark) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS) @Fork(value = 2) public class LatencyTest { public static final long LONG_PAUSE = 5000L; public static final long SHORT_PAUSE = 1L; public Random rand; @Setup public void initI() { rand = new Random(0xDEAD_BEEF); } private long busyPause(long pauseInNanos) { Blackhole.consumeCPU(pauseInNanos); return pauseInNanos; } @Benchmark @BenchmarkMode({Mode.AverageTime}) public long latencyBusyPauseShort() { return busyPause(SHORT_PAUSE); } @Benchmark @BenchmarkMode({Mode.AverageTime}) public long latencyBusyPauseLong() { return busyPause(LONG_PAUSE); } @Benchmark @BenchmarkMode({Mode.AverageTime}) public long latencyFunc() { return doCalculation(1); } @Benchmark @BenchmarkMode({Mode.AverageTime}) public long measureShort() { long x = busyPause(SHORT_PAUSE); return doCalculation(x); } @Benchmark @BenchmarkMode({Mode.AverageTime}) public long measureLong() { long x = busyPause(LONG_PAUSE); return doCalculation(x); } private long doCalculation(long x) { long calculation = 0; calculation += x / (rand.nextInt(5) + 1); calculation -= calculation / (rand.nextInt(5) + 1); calculation -= x / (rand.nextInt(6) + 1); calculation += calculation / (rand.nextInt(6) + 1); return calculation; } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(LatencyTest.class.getName()) .verbosity(VerboseMode.NORMAL) .build(); new Runner(options).run(); } }

Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. So my results are:

Benchmark Mode Cnt Score Error Units LatencyTest.latencyBusyPauseLong avgt 4 15992.216 ± 106.538 ns/op LatencyTest.latencyBusyPauseShort avgt 4 6.450 ± 0.163 ns/op LatencyTest.latencyFunc avgt 4 97.321 ± 0.984 ns/op LatencyTest.measureLong avgt 4 16103.228 ± 102.338 ns/op LatencyTest.measureShort avgt 4 100.454 ± 0.041 ns/op

Please note that the results are almost additive, i.e. latencyFunc + latencyBusyPauseShort = measureShort

Problem N3. The discrepancy is big.

What is wrong with your test? It does not warm-up JVM properly, i.e. it uses one parameter to warm-up and another to test. Why is this important? JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. Here is an event from JIT compilation log(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation):

<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>

After property change JIT uses uncommon trap in order to process your code which is not optimal. I've created a benchmark which is based on original one with minor changes:

busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function vdso clock_gettime and we unable to profile this code)

all calculations are removed

_

import java.util.Arrays; public class JvmPauseLatency { private static final int WARMUP = 2000 ; private static final int EXTRA = 10; private static final long PAUSE = 70000L; // in nanos private static volatile long consumedCPU = System.nanoTime(); //org.openjdk.jmh.infra.Blackhole.consumeCPU() private static void consumeCPU(long tokens) { long t = consumedCPU; for (long i = tokens; i > 0; i--) { t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL); } if (t == 42) { consumedCPU += t; } } public void run(long warmPause) { long[] results = new long[WARMUP + EXTRA]; int count = 0; long interval = warmPause; while(count < results.length) { consumeCPU(interval); long latency = System.nanoTime(); latency = System.nanoTime() - latency; results[count++] = latency; if (count == WARMUP) { interval = PAUSE; } } System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length))); } public static void main(String[] args) { int totalCount = 0; while (totalCount < 100) { new JvmPauseLatency().run(0); totalCount ++; } } }

And results are

Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196] Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181] Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]

In order to fix this benchmark just replace new JvmPauseLatency().run(0) with new JvmPauseLatency().run(PAUSE); and here is the results:

Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43] Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]

If you want to change "pause" dynamically - you have to warm-up JVM dynamically, i.e.

while(count < results.length) { consumeCPU(interval); long latency = System.nanoTime(); latency = System.nanoTime() - latency; results[count++] = latency; if (count >= WARMUP) { interval = PAUSE; } else { interval = rnd.nextBoolean() ? PAUSE : 0; } }

Problem N4. What about interpreter -Xint?

In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. I've made 3 experiments:

random warmup

constant warmup with 0 pause

the whole test uses pause 0 including

Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency and the results are:

Performance counter stats for 'java -Xint JvmPauseLatency': 272,822,274,275 cycles 723,420,125,590 instructions # 2.65 insn per cycle 26,994,494 cache-references 8,575,746 cache-misses # 31.769 % of all cache refs 2,060,138,555 bus-cycles 2,930,155 branch-misses 86.808481183 seconds time elapsed Performance counter stats for 'java -Xint JvmPauseLatency': 2,812,949,238 cycles 7,267,497,946 instructions # 2.58 insn per cycle 6,936,666 cache-references 1,107,318 cache-misses # 15.963 % of all cache refs 21,410,797 bus-cycles 791,441 branch-misses 0.907758181 seconds time elapsed Performance counter stats for 'java -Xint JvmPauseLatency': 126,157,793 cycles 158,845,300 instructions # 1.26 insn per cycle 6,650,471 cache-references 909,593 cache-misses # 13.677 % of all cache refs 1,635,548 bus-cycles 775,564 branch-misses 0.073511817 seconds time elapsed

In case of branch miss latency and footprint grows non-linearly due to huge memory footprint.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复