问题
I'm testing a Jetty-based API vs a Netty-based one. With the only difference in the experiment being which API I use (same application, same servers, same memory config, same load etc. etc.), I get longer GC pauses with the Netty-based one. Mostly, pauses are below a millisecond, but after a few days of running smoothly, every 12-24hours I'll see a 4-6 second pause that does not show up with the Jetty-based API.
Whenever this happens, there is extremely little information about what G1 was doing that caused it to issue a STW, note the second pause message here:
2016-02-23T05:22:27.709+0000: 66360.282: Total time for which application threads were stopped: 0.0319639 seconds, Stopping threads took: 0.0000716 seconds
2016-02-23T05:22:35.642+0000: 66368.215: Total time for which application threads were stopped: 6.9705594 seconds, Stopping threads took: 0.0000737 seconds
2016-02-23T05:22:35.673+0000: 66368.246: Total time for which application threads were stopped: 0.0048374 seconds, Stopping threads took: 0.0040574 seconds
My GC options are:
-XX:+UseG1GC
-XX:+G1SummarizeConcMark
-XX:+G1SummarizeRSetStats
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+DisableExplicitGC
-XX:InitialHeapSize=12884901888
-XX:MaxHeapSize=12884901888
And, for reference, my VM options are:
-XX:+AlwaysPreTouch
-XX:+DebugNonSafepoints
-XX:+FlightRecorder
-XX:FlightRecorderOptions=stackdepth=500
-XX:-OmitStackTraceInFastThrow
-XX:+TrustFinalNonStaticFields
-XX:+UnlockCommercialFeatures
-XX:+UnlockDiagnosticVMOptions
-XX:+UnlockExperimentalVMOptions
-XX:+UseCompressedClassPointers
-XX:+UseCompressedOops
How do I find out why G1 stopped the world at 2016-02-23T05:22:35.642
?
回答1:
Not all STW pauses - the mechanism used to trigger them is called a safepoint - are caused by the GC, use -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1
to print other safepoint causes.
Secondly, if the pauses are caused by GC then the lines you pasted themselves do not contain the cause, but an adjacent block from the GC log should contain the cause, something like [GC pause (G1 Evacuation Pause) (young), 0.0200285 secs]
Additionally you may also want to monitor disk IO latency and match timestamps to safepoint pauses. Any Sync IO or paging happening during safepoints that goes to slow storage might stall the entire safepoint. Putting logfiles and /tmp
on a tmpfs or SSDs may help there.
回答2:
To add some closure to this: The issue was that this was not, technically, a GC pause; it was a combination of several factors:
- AWS throttles IO to what you've paid for
- /tmp on Ubuntu by default ended up on our (throttled) EBS volume
- the JVM by default writes to /tmp during stop-the-world(!)
Other parts of our application reached the EBS throttling threshold, and when the JVM tried to write to /tmp during a STW, all threads on the JVM became queued behind the AWS throttling point.
It seems the Netty/Jetty difference was a red herring.
We need our application to survive in this kind of environment, so our solution was to disable this JVM behavior, at the cost of loosing support from several JVM tools we added:
-XX:+PerfDisableSharedMem
More info on this issue from this excellent blog post: http://www.evanjones.ca/jvm-mmap-pause.html
来源:https://stackoverflow.com/questions/35618747/how-do-i-get-g1-to-print-more-log-details