Hive partition pruning on computed column

巧了我就是萌 提交于 2019-12-23 17:18:55

问题


I have a few tables on Hive and my query is trying to retrieve the data for the past x days. Hive is pruning the partitions when I use a direct date, but is doing a full table scan when using a formula instead.

select *
from   f_event
where  date_key > 20160101;

scanned partitions..

s3://...key=20160102 [f]
s3://...key=20160103 [f]
s3://...key=20160104 [f]

If I use a formula, say, to get the past 4 weeks of data

Select count(*)
From    f_event f
Where  date_key  > from_unixtime(unix_timestamp()-2*7*60*60*24, 'yyyyMMdd')

This is scanning all partitions in the table.

environment : Hadoop 2.6.0, EMR, Hive on S3, Hive 1.0.0


回答1:


Hive doesn't trigger partition pruning when the filtering expression contains non-deterministic functions such as unix_timestamp().

A good reason for this was mentioned in the discussion:

Imagine a situation where you had:

WHERE partition_column = f(unix_timestamp()) AND ordinary_column = f(unix_timestamp).

The right hand side of the predicate has to be evaluated at map-time, whereas you're assuming that left hand side should be evaluated at compile time, which means you have two different values of unix_timestamp() floating around, which can only end badly.



来源:https://stackoverflow.com/questions/35782664/hive-partition-pruning-on-computed-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!