I\'m trying to avoid the Full GC (from gc.log sample below) running a Grails application in Tomcat in production. Any suggestions on how to better configure the GC?
Below is my setting for 4 core Linux box.
In my experience, you can tune -XX:NewSize -XX:MaxNewSize -XX:GCTimeRatio to achieve high throughput and low latency.
-server
-Xms2048m
-Xmx2048m
-Dsun.rmi.dgc.client.gcInterval=86400000
-Dsun.rmi.dgc.server.gcInterval=86400000
-XX:+AggressiveOpts
-XX:GCTimeRatio=20
-XX:+UseParNewGC
-XX:ParallelGCThreads=4
-XX:+CMSParallelRemarkEnabled
-XX:ParallelCMSThreads=2
-XX:+CMSScavengeBeforeRemark
-XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:NewSize=512m
-XX:MaxNewSize=512m
-XX:PermSize=256m
-XX:MaxPermSize=256m
-XX:SurvivorRatio=90
-XX:TargetSurvivorRatio=90
-XX:MaxTenuringThreshold=15
-XX:MaxGCMinorPauseMillis=1
-XX:MaxGCPauseMillis=5
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-Xloggc:./logs/gc.log
The log snippet posted shows you have a substantial number of objects that are live for >320s (approx 40s per young collection and objects survive through 8 collections before promotion). The remaining objects then bleed into tenured and eventually you hit an apparently unexpected full gc which doesn't actually collect very much.
3453285K->3099828K(4194304K)
i.e. you have a 4G tenured which is ~82% full (3453285/4194304) when it is triggered and is ~74% full after 13 long seconds.
This means it took 13s to collect the grand total of ~350M which, in the context of a 6G heap is not v much.
This basically means your heap is not big enough or, perhaps more likely, you have a memory leak. A leak like this is a terrible thing for CMS because a concurrent tenured collection is a non compacting event which means tenured is a collection of free lists which means fragmentation can be a big problem for CMS which means that your utilisation of tenured becomes increasingly inefficient which means that there is an increased probability of promotion failure events (though if this were such an event then I'd expect to see a log message saying that) because it wants to promote (or thinks it will need to promote) X MB into tenured but it does not have a (contiguous) free list >= X MB available. This triggers an unexpected tenured collection which is a not remotely concurrent STW event. If you actually have v little to collect (as you do) then there is no surprise you're sitting twiddling your thumbs.
Some general pointers, to a large extent reiterating what Vladimir Sitnitov has said...
ParNew
collection
Some questions...
Please, describe how many CPUs can be used for Tomcat? 4?
What java version are you using? (>1.6.0u23 ?)
0) From the Full GC output, it definitely looks like you are hitting memory limit: even after full gc, there is still 3099828K of used memory (out of 4194304K). There is just no way to prevent Full GC when you are out of memory.
Is 3.1Gb working set expected for your application? That is 3.1Gb of non-garbage memory!
If that is expected, it is time to increase -Xmx/-Xms. Otherwise, it is time to collect and analyze heap dump to identify memory hog.
After you solve the problem of 3Gb working set, you may find the following advice useful: From my point of view, regular (non incremental) CMS mode, and reducing NewSize are worth trying.
1) Incremental mode is targeted at single cpu machines, when CMS thread yields CPU to other threads.
In case you have some spare CPU (e.g. you are running multicore machine) it is better to perform GC in the background without yields.
Thus I would recommend removing -XX:+CMSIncrementalMode.
2) -XX:CMSInitiatingOccupancyFraction=60 tells CMS to start background GC after OLD gen is 60% full.
In case there is garbage in the heap, and CMS does not keep up with it, it makes sense lowering CMSInitiatingOccupancyFraction. For instance, -XX:CMSInitiatingOccupancyFraction=30, so CMS would start concurrent collection when old gen is 30% full. Currently it is hard to tell if it is the case, since you just do not have garbage in the heap.
3) Looks like "extended tenuring" does not help -- the objects just do not die out even after 7-8 tenurings. I would recommend reducing SurvivorRatio (e.g., SurvivorRatio=2, or just remove the option and stick with default). That would reduce the number of tenurings resulting in reduced minor gc pauses.
4) -XX:NewSize=2G. Did you try lower values for NewSize? Say, NewSize=512m. That should reduce minor gc pauses and make promotions young->old less massive, simplifying work for CMS.
I'm serving requests and expect that beyond a certain amount of shared objects, every other objects are useful only to the request at hand. That's the theory, but any kind of caches can easily void that assumption and create objects that live beyond the request.
As others have noted neither your huge young generation nor the extended tenuring seems to work.
You should profile your application and analyze the age-distribution of objects. I'm pretty sure Grails caches all kinds of things beyond the scope of a request and that's what leaks into the old gen.
What you're essentially trying is to sacrifice the young generation pause times (for a young gen of 2GB) to postpone the inevitable - an old gen collection of 6GB. This is not exactly a good tradeoff you're making there.
Instead you probably should aim for better young gen pause times and allow CMS to burn more CPU time: more conrrent-phase GC threads (can't remember the option for that one), higher GCTimeRatio
, a MaxGCPauseMillis
> MaxGCMinorPauseMillis
to take pressure of the minor collections and allow them to meet their pause goals instead of having to resize to fit the major collection limit.
To make major GCs less painful you might want to read this: http://blog.ragozin.info/2012/03/secret-hotspot-option-improving-gc.html (this patch should be in j7u4). CMSParallelRemarkEnabled
should be enabled too, not sure if this is the default.
Personally I have some horrible experiences with G1GC working itself into a corner due to some very large LRU-like workloads and then falling back to a large, stop-the-world collection far more often than CMS experienced concurrent mode failures for the same workload.
But for other workloads (like yours) it might actually do the job and collect the old generation incrementally, while also compacting and thus avoiding any big pauses.
Give it a try if you haven't already. Again, update to the newest java7 before you do so, G1 still has some issues with its heuristics that they're trying to iron out.
Edit: Oracle has improved G1GC's heuristics and some bottlenecks since I have written this answer. It should definitely be worth a try now.
As you're already using a parallel collector for a 2GB young gen and get away with 200ms pause times... why not try the parallel old gen collector on your 6G heap? It would probably take less than the 10s+ major collections you're seeing with CMS. Whenever CMS runs into one of its failure modes it does a single-threaded, stop-the-world collection.
Your survivor sizes aren't decreasing much, if at all - ideally they should be decreasing steeply, because you only want a minority of objects to survive long enough to reach the Old generation.
This suggests that many objects are living a relatively long time - which can happen when you have many open connections, threads etc that are not handled quickly, for example.
(Do you have any options to change the application, incidentally, or can you only modify the GC settings? There might also be Tomcat settings that would have an effect...)