Running with JVM:
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
CentOS release 6.4 (Final)
Jvm Options:
-Xmx4g -Xms4g -XX:MaxPermSize=4g -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintClassHistogram -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+DisableExplicitGC
Running in an OSGI environment, Aerospike DB, NETTY (NIO) for networking.
Ran a weekend longevity test. This was the last print:
[2015-12-11 09:54:51,185] INFO : [GC pause (young)
After 2 days I ran strace on the pid, and then those are the next prints:
[2015-12-11 09:54:51,185] INFO : [GC pause (young) 3598M->1458M(4096M), 0.0280020 secs]
[2015-12-13 11:54:54,353] INFO : [GC pause (young) 3598M->1464M(4096M), 180001.5628870 secs]
The first print finished and the next print showed a 2 days GC.
The jvm did not respone to thread dump signals during the freeze (pkill -QUIT pid). This freeze happens every few days. The freeze happens not only with the G1 collector, but also with CMS collector. How can I start debugging this, and what can potentially cause this?
Thank you.
EDIT: Had another freeze, this time the strace does not release the freeze. The second freeze was released using jstack.
UPDATE: Found the problem! Look at the answer below.
I found the problem!
It is a kernel bug in futex_wait()
that was backported to our kernel version.
You can read about it here: