Hive Data selecting latest value based on timestamp

前端 未结 2 1194
孤独总比滥情好
孤独总比滥情好 2021-01-03 15:17

I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:1         


        
相关标签:
2条回答
  • 2021-01-03 16:08

    You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.

    You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.

    0 讨论(0)
  • 2021-01-03 16:13
    select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
      select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
      from my_table group by C1
    ) t;
    

    Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.

    0 讨论(0)
提交回复
热议问题