hive sql find the latest record

后端 未结 8 1989
别那么骄傲
别那么骄傲 2021-01-30 17:28

the table is:

create table test (
id string,
name string,
age string,
modified string)

data like this:

id    name   age  modife         


        
相关标签:
8条回答
  • 2021-01-30 18:04

    Give this a try:

    select t1.* from test t1
    join (
      select id, max(modifed) maxModified from test
      group by id
    ) s
    on t1.id = s.id and t1.modifed = s.maxModified
    

    Fiddle here.

    Left outer join solution here.

    Let us know which one runs faster :)

    0 讨论(0)
  • 2021-01-30 18:07

    There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:

    test_argmax
    id,val,key
    1,1,A
    1,2,B
    1,3,C
    1,2,D
    2,1,E
    2,1,U
    2,2,V
    2,3,W
    2,2,X
    2,1,Y
    

    You can do this:

    select 
      max(struct(val, key, id)).col1 as max_val,
      max(struct(val, key, id)).col2 as max_key,
      max(struct(val, key, id)).col3 as max_id
    from test_argmax
    group by id
    

    and get the result:

    max_val,max_key,max_id
    3,C,1
    3,W,2
    

    I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?

    0 讨论(0)
  • 2021-01-30 18:07

    Just slightly different approach than what has been answered in previous answer.

    Below example uses hive windowing function to find out the latest record, read more here

    SELECT t.id
        ,t.name
        ,t.age
        ,t.modified
    FROM (
        SELECT id
            ,name
            ,age
            ,modified
            ,ROW_NUMBER() OVER (
                PARTITION BY id ORDER BY unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') DESC
                ) AS ROW_NUMBER   
        FROM test
        ) t
    WHERE t.ROW_NUMBER <= 1;
    

    The modified is string so converting it to timestamp using unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') then applying order by on timestamp.

    0 讨论(0)
  • 2021-01-30 18:10

    There is a relatively recent feature of Hive SQL, analytic functions and the over clause. This should do the job without joins

    select id, name, age, last_modified 
    from ( select id, name, age, modified, 
                  max( modified) over (partition by id) as last_modified 
           from test ) as sub
    where   modified = last_modified 
    

    What's going on here is that the subquery produces a new row with an extra column last_modified which has the latest modified timestamp for the corresponding person's id. (Similar to what group by would do) The key here is that the subquery gets you again one row per row in your original table and then you filter from that.

    There is a chance that even the simpler solution works:

    select  id, name, age,  
            max( modified) over (partition by id) last_modified 
    from test 
    where   modified = last_modified 
    

    By the way, the same code would work in Impala, too.

    0 讨论(0)
  • 2021-01-30 18:15

    try this

    select id,name,age,modified from test
     where modified=max(modified)
     group by id,name
    
    0 讨论(0)
  • 2021-01-30 18:15

    If u can make sure that row which has max modified also has max age in same id row set.

    Try

    select id, name, max(age), max(modified) 
    from test
    group by id, name
    
    0 讨论(0)
提交回复
热议问题