hive sql find the latest record

后端 未结 8 1967
别那么骄傲
别那么骄傲 2021-01-30 17:28

the table is:

create table test (
id string,
name string,
age string,
modified string)

data like this:

id    name   age  modife         


        
8条回答
  •  有刺的猬
    2021-01-30 18:17

    Presume the data is like this:

        id      name    age     modifed
        1       a       10      2011-11-11 11:11:11
        1       a       11      2012-11-11 12:00:00
        2       b       23      2012-12-10 10:11:12
        2       b       21      2012-12-10 10:11:12
        2       b       22      2012-12-15 10:11:12
        2       b       20      2012-12-15 10:11:12
    

    then the result of the above query will give you - (notice the repeated 2, b having the same date time)

        1       a       11      2012-11-11 12:00:00
        2       b       22      2012-12-15 10:11:12
        2       b       20      2012-12-15 10:11:12
    

    This query runs an additional group by and is less efficient but gives the correct result -

        select collect_set(b.id)[0], collect_set(b.name)[0], collect_set(b.age)[0], b.modified
        from
            (select id, max(modified) as modified from test group by id) a
          left outer join
            test b
          on
            (a.id=b.id and a.modified=b.modified)
        group by
          b.modified;
    

    then the result of the above query will give you

        1       a       11      2012-11-11 12:00:00
        2       b       20      2012-12-15 10:11:12
    

    Now if we improve the query a little - then in place of 3 MRs it runs only one Keping the result same -

        select id, collect_set(name)[0], collect_set(age)[0], max(modified)
        from test 
        group by id;
    

    Note - this will slow down in case your group by field produces large results.

提交回复
热议问题