问题
I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge. Please let me know, how can i get the data in a convenient way using hive sql.
回答1:
If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc)
, without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.
来源:https://stackoverflow.com/questions/59674580/get-latest-data-from-hive-table-with-multiple-partition-columns