Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

我怕爱的太早我们不能终老 提交于 2021-01-28 14:18:32

问题


How can I optimize a query of this form since I keep running into this OOM error? Or come up with a better execution plan? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.

When the job fails, the beeline output shows the OOM Java heap space. Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error. Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error. In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.

I am using Azure HDInsight Interactive Query 4.0. 20 worker node, D13v2 8 core, and 56GB RAM.

Source table

create external table database.sourcetable(
  a,
  b,
  c,
  ...
  (183 total columns)
  ...
)
PARTITIONED BY ( 
  W string, 
  X int, 
  Y string, 
  Z int
)

Target Table

create external table database.NEWTABLE(
  a,
  b,
  c,
  ...
  (187 total columns)
  ...
  W,
  X,
  Y,
  Z
)
PARTITIONED BY (
  aAAA,
  bBBB
)

Query

insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA, 
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable

回答1:


If everything else is okay, try to add distribute by partiton key at the end of your query:

  from database.sourcetable 
  distribute by aAAA, bBBB, cCCC

As a result each reducer will create only one partition file, consuming less memory




回答2:


Try sorting the partitioned columns:

SET hive.optimize.sort.dynamic.partition=true;

When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties



来源:https://stackoverflow.com/questions/62804436/optimize-hive-query-java-lang-outofmemoryerror-java-heap-space-gc-overhead-lim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!