Latencies issues which G1GC

帅比萌擦擦* 提交于 2019-12-03 23:28:51
Sahil Aggarwal

First Point is

You need to check is there any connection leak in your application.

But there can be one parameter in G1GC, which you can analyze :

Humongous Objects

At this point the majority of the functionality and architecture of G1GC has been flushed out, with the exception of the biggest weakness/complexity, the Humongous object. As mentioned previously, any single data allocation ≥ G1HeapRegionSize/2 is considered a Humongous object, which is allocated out of contiguous regions of Free space, which are then added to Tenured. Let's run through some basic characteristics and how they affect the normal GC lifecycle. The following discussion on Humongous objects will provide insight into the downsides of Humongous objects such as:

Increase the risk of running out of Free space and triggering a Full GC
Increase overall time spent in STW

Humongous objects are allocated out of Free space. Allocation failures trigger GC events. If an allocation failure from Free space triggers GC, the GC event will be a Full GC, which is very undesirable in most circumstances. To avoid Full GC events in an application with lots of Humongous objects, one must ensure the Free space pool is large enough as compared to Eden that Eden will always fill up first. One usually ends up being over cautious and the application ends up in a state where the Free ram pool is quite large and never fully utilized, which is by definition wasting RAM.

Humongous objects are freed at the end of an MPCMC

Up until around Oracle JDK 8u45, it was true that Humongous objects were only collected at the end of runs of the MPCMC. The release notes for versions of Oracle 8u45-8u65 have a few commits indicating some, but not all, Humongous objects are being collected during Minor events.

So, You can try by updating latest java.

Humongous objects that are only collectible at the end of an MPCMC will increase the requirements for reserved Free space or be more likely to trigger a Full GC.

Finding How Much Humongous Object:

Step 1. : run following command on your gc.log

Command 1 :

grep "source: concurrent humongous allocation" /tmp/gc.log | sed 's/.*allocation request: \([0-9]*\) bytes.*/\1/' > humoungous_humongoud_size.txt

Command 2 :

awk -F',' '{sum+=$1} END{print sum;}' humoungous_humongoud_size.txt

It will give you the size of humongous objects generated in my application.

But Lastly, if your application has memory leaks you have to solve that.

It's a young collection and almost everything is dieing young, so as opposed to the comments above this does not seem to be an issue with the old generation filling up.

[Ext Root Scanning (ms): Min: 0.0, Avg: 140.9, Max: 2478.3, Diff: 2478.3, Sum: 2818.8]

It's basically spending most of the time scanning GC roots and the other phases are then later held up waiting for this phase to finish.

Do you have a lot of threads (you only mention active ones)? Or is your application leaking classes or dynamically generating more and more bytecode?

The application is generating a lot of classes dynamically for each service call, and given the call volume, i suspect those classes might be an issue, but not sure how to resolve it.

You first have to figure out whether those generated classes get collected at all during old generation collections. If not you have a leak and need to fix your application. If they pile up but get collected eventually you only need to have the old generation collected more frequently, e.g. by decreasing the young generation size (which puts more pressure on the old gen) or by decreasing the IHOP.

keerthi

If I were in your position, this is what I would do.

  1. Get the GC logs for a couple of days and load it to http://gceasy.io/ to assess how the memory grows.
  2. Change the Garbage Collection mechanism from G1 to Parallel collector temporarily. I suggest going to parallel collector since it allocates the memory in a linear fashion and is relatively easy to check whether you have a memory leak. You also get a good comparison to G1. That doesn't mean you will have to permanently move to parallel, it is just for temporary comparison.
  3. If the heap is growing continuously in a linear fashion without being garbage collected, then it is definitely a memory leak and you will have to find that.
  4. If you can't see any evidence of memory leak, then you can think about tweaking the garbage collection settings.

Tweaking G1 garbage collector to suit your service is very important. G1 without any tunings might be very bad for some of the services like we had, which performed much worse than the parallel collector. But now with specific tunings, it works better now on our server which has 64 cores and 256 GB RAM.

egorlitvinenko

First of all, you have a big time spent on objects copying between generations.

G1 Evacuation Pause 263 259 ms 560 ms 1 min 8 sec 50 ms 91.61%

According to this

[Eden: 5512.0M(5512.0M)->0.0B(4444.0M) Survivors: 112.0M->128.0M Heap: 8222.2M(12.0G)->2707.5M(12.0G)]

[Object Copy (ms): Min: 0.0, Avg: 41.9, Max: 68.7, Diff: 68.7, Sum: 837.9]

[Update RS (ms): Min: 0.0, Avg: 5.3, Max: 41.9, Diff: 41.9, Sum: 106.9]

Ref Proc: 37.7 ms

[Ext Root Scanning (ms): Min: 0.0, Avg: 140.9, Max: 2478.3, Diff: 2478.3, Sum: 2818.8]

This all about managing objects between regions. Looks like you have a lot of short time living objects, but spend a lot of time on managing objects between regions. Try to play with Young Gen size. From one side you could increase it, so you will spend less time with objects copying. It could increase the time for roots analysis too. If most of the objects are dead it would be ok. But if not, you should conversely decrease Young Size, then G1 would be called more frequently but spends less time for one invocation, this provides more predictive behavior and latency. As we can see the biggest time is spent on roots analysis, I would start with decrease Young Gen size to 3GB to see what happens.

Second is

Termination (ms): Min: 0.0, Avg: 2282.3, Max: 2415.8, Diff: 2415.8, Sum: 45645.3]

Your G1 spends a lot of time on termination process, where it tries finish thread work, but before that, it checks all queues and steals tasks. You could see that there are a lot of attempts of stealing:

Termination Attempts: Min: 1, Avg: 21.5, Max: 68, Diff: 67, Sum: 430

21.5 attempts per one worker thread it is not a bit. If worker thread successfully stole tasks it would be continued. You have big concurrency here and I suppose you could decrease the number of GC threads.

p.s. To choose GC you should use:

  1. Parallel GC, if not appropriate then...
  2. G1, if not appropriate then...
  3. tuned CMS with ParNew.

If I were you, I would be using CMS with ParNew on your place to provide good latency for this scenario.

See also understanding G1 logs

I am concerned mainly with this

[Ext Root Scanning (ms): Min: 0.0, Avg: 140.9, Max: 2478.3, Diff: 2478.3, Sum: 2818.8]

Someone in this post asked if you were generating a large number of dynamic classes. If you did then it might be true why Ext Root Scanning would be long.

On the other hand contention of resources could be another reason. You say you are using M4-2X Large EC2 boxes. According to this [https://aws.amazon.com/ec2/instance-types/][1], this machine has 8 Virtual cores.

You set the number of GC workers to 20 when there are only 8 cores. As result chances are there is contention to even schedule the Gc worker. There might be other system resources too which are contending for CPU. As a result, the GC worker thread might be scheduled late causing the Root Scanning phase to appear large.

It would also cause termination phase to be large. Because other threads would finish first.

You spoke about increasing the Gc worker threads, setting it to 8 or less may be even 6 might help. It's worth a shot.

I see that this post was asked a long time ago. If you did manage to solve it, I would be interested to know what you did.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!