问题
We got two CDH cluster with the same version(CDH-5.5.2-1.cdh5.5.2.p0.4), and both the ResourceManager of each cluster with the same configuration.
One of the ResourceManager is running well, and its heap memory is stay in a constant value(e.g 800mb) as the time is going on.
But the other one will throw OOM exception and exit after 15 days. When we use 'jmap -F -histo' to dump its jvm heap info, we are seeing that the size of object 'char[]' is growing up as the time is moving, and it finally throw OOM.
Following is key info of jvm dump result of both the good RM and OOM RM:
dump cmd:jmap -F -histo pid
A)jvm dump of good RM in cluster A [we are seeing that 40w+ char[] instances with 60m+ heap mem][1]
B)jvm dump of bak RM(OOM) in cluster B [we are seeing that 30w+ char[] instances but with 400m+ heap mem][2]
Any help wil be appreciated.
We dump(jmap -F -dump:file=file.dump_result pid) heap info today, and use MAT(memory analyzer tools) to analyse the dump file, we found that the instance variable applications(java.util.concurrent.ConcurrentHashMap) in org.apache.hadoop.yarn.server.resourcemanager.RMActiveServiceContext eats up a lot of memory:
call hierachry information
instance variable: applications
来源:https://stackoverflow.com/questions/40861974/resourcemanager-memory-leak