More efficient query to avoid OutOfMemoryError in Hive

问题

I'm getting a java.lang.OutOfMemoryError: GC overhead limit exceeded in Hive. In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?

I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?

select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei

Thank you

回答1:

Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.

I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:

distribute by substr(imei...), col2

Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM

回答2:

In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning

But for now, this should give you some improvements:

Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where 
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and pd.timestamp=ed.pd.timestamp
and SUBSTR(pd.imei,12,2) in ('04','05') 
and ed.event = 'Use' AND ed.clientversion like '3.2%';

if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.

Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where 
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and SUBSTR(pd.imei,12,2) in ('04','05') 
and ed.event = 'Use' AND ed.clientversion like '3.2%';

Try running both queries and compare results. Do let us know the differences and if you find this helpful.

来源：https://stackoverflow.com/questions/38468802/more-efficient-query-to-avoid-outofmemoryerror-in-hive

标签

performance

Hadoop

Hive

out-of-memory

memory-efficient