Elasticsearch: restart node after java.lang.OutOfMemoryError: Java heap space

ぃ、小莉子 提交于 2020-12-08 07:59:10

问题


One of my ES nodes has failed because of java.lang.OutOfMemoryError: Java heap space error. Here is the full stack trace from the logs:

    [2020-09-18T04:25:04,215][WARN ][o.e.a.b.TransportShardBulkAction] [search1] [[my_index_4][0]] failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][WARN ][o.e.c.a.s.ShardStateAction] [search1] [my_index_4][0] received shard failed for shard id [[my_index_4][0]], allocation id [BUpviwHxQK2qC3GrELC2Hw], primary term [2], message [failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]], failure [NodeDisconnectedException[[search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [search1] failed to execute on node [cm_76wfGRFm9nbPR1mJxTQ]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][cluster:monitor/nodes/info[n]] disconnected
[2020-09-18T04:25:04,219][INFO ][o.e.c.r.a.AllocationService] [search1] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[my_index_4][0]] ...]).
[2020-09-18T04:25:05,450][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][11099506] overhead, spent [605ms] collecting in the last [1.4s]
[2020-09-18T04:25:05,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search1] fatal error in thread [elasticsearch[search1][search][T#5]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource$GlobalOrdinalValuesSource.<init>(CompositeValuesSource.java:137) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource.wrapGlobalOrdinals(CompositeValuesSource.java:123) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesComparator.<init>(CompositeValuesComparator.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregator.<init>(CompositeAggregator.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationFactory.createInternal(CompositeAggregationFactory.java:52) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:55) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$14(IndicesService.java:1133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2241/341562582.accept(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$15(IndicesService.java:1186) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2242/1286052129.get(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:412) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1192) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1132) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:305) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Because of the exception above, I am getting master_not_discovered_exception when I am hitting any of ES APIs.

Question: Can anyone tell me the next steps that I should perform to put Elasticsearch back to normal state? Is there a way to restart disconnected node?


回答1:


First let me briefly explains what might have caused this issue:

  1. As mentioned in the logs, you seems to be running costly aggregation, which are in general memory intensive and known to consume a lot of memory, which your Garbage collection(GC) was not able to reclaim, and eventually your application(ES) ran out of memory and got killed.
  2. Apart from costly aggregations which is shown in the logs, high memory consumption can also be caused by heavy searches and indexing request, so please have a look at this node's both search and index slow logs, refer ES slow logs for more info

Now coming to resolution part

This ES node is dead, which is causing master_not_discovered_exception hence its important to bring restart this node again and see if this exception goes. Refer this opster's blog to read more about this exception.

Prevention of OOM exception

  1. You should properly configure the circuit breaker available in ES and if possible upgrade to ES 7.X which has better circuit breakers based on real-memory
  2. Improve ES indexing and search performance.



回答2:


The java.lang.OutOfMemoryError: Java heap space was caused by running the composite aggregation query for which I set the size parameter to Integer.MAX_VALUE:

{
    "size": 0,

    "aggregations": {
        "myParam.keyword": {
            "composite": {
                "size": 2147483647,
                "sources": [
                    {
                        "myParam.keyword": {
                            "terms": {
                                "field": "myParam.keyword",
                                "order": "asc"
                            }
                        }
                    }
                ]
            }
        }
    }
} 

According to stack trace, the error occurred while initialization of aggregation values array CompositeValuesSource.java:137:

GlobalOrdinalValuesSource(ValuesSource.Bytes.WithOrdinals vs, int size, int reverseMul) {
    super(vs, size, reverseMul);
    this.values = new long[size];
}

Here, the size parameter is coming from the query.

The answer https://stackoverflow.com/a/63965634/5284890 confirms the root cause.

My next step was stopping and running Elasticsearcch again using the following commands

sudo systemctl stop elasticsearch.service
sudo systemctl start elasticsearch.service

My following steps will be to check suggested circuit breaker in ES article mentioned in this answer https://stackoverflow.com/a/63965634/5284890.



来源:https://stackoverflow.com/questions/63953220/elasticsearch-restart-node-after-java-lang-outofmemoryerror-java-heap-space

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!