Hive getting top n records in group by query

后端 未结 6 1959
终归单人心
终归单人心 2020-12-07 17:47

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-ad

相关标签:
6条回答
  • 2020-12-07 18:34

    As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

    SELECT page-id, user-id, clicks
    FROM (
        SELECT page-id, user-id, rank() 
               over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
        FROM my table
    ) ranked_mytable
    WHERE ranked_mytable.rank < 5
    ORDER BY page-id, rank
    

    No UDF required, and only one subquery! Also, all of the rank logic is localized.

    You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

    0 讨论(0)
  • 2020-12-07 18:42

    You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

    SELECT page-id, user-id, clicks
    FROM (
        SELECT page-id, user-id, rank(user-id) as rank, clicks
        FROM mytable
        DISTRIBUTE BY page-id, user-id
        SORT BY page-id, user-id, clicks desc
    ) a 
    WHERE rank < 5
    ORDER BY page-id, rank
    
    0 讨论(0)
  • 2020-12-07 18:43

    You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.

    select
      page-id, 
      user-id,
      clicks
    from (
      select
        each_top_k(5, page-id, clicks, page-id, user-id)
          as (rank, clicks, page-id, user-id)
      from (
        select
          page-id, user-id, clicks
        from
          mytable
        DISTRIBUTE BY page-id SORT BY page-id
      ) t1
    ) t2
    order by page-id ASC, clicks DESC
    

    The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.

    0 讨论(0)
  • 2020-12-07 18:43

    select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5

    there might be a change in choosing the columns but the logic is correct.

    0 讨论(0)
  • 2020-12-07 18:49

    Let us say your data looks like following :

    page-id   user-id   clicks
    page1     user1     10
    page1     user2     10
    page1     user3     9
    page1     user4     8
    page1     user5     7
    page1     user6     7
    page1     user7     6
    page1     user8     5
    page2     user1     20
    page2     user2     19
    page2     user3     18
    

    Below Query will give you :

    SELECT page-id, user-id, clicks, rank
    FROM (
        SELECT page-id, user-id, rank() 
               over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
        FROM your_table
    ) ranked_table
    WHERE ranked_table.rank <= 5
    

    Result :

    page-id   user-id   clicks  rank
    page1     user1     10      1
    page1     user2     10      1 
    page1     user3     9       3 
    page1     user4     8       4
    page1     user5     7       5 
    page1     user6     7       5 
    page2     user1     20      1
    page2     user2     19      2  
    page2     user3     18      3
    

    So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.

    But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query

    SELECT page-id, user-id, clicks, rank
    FROM (
        SELECT page-id, user-id, row_number() 
               over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
        FROM your_table
    ) ranked_table
    WHERE ranked_table.rank <= 5
    

    Result :

    page-id   user-id   clicks  rank
    page1     user1     10      1
    page1     user2     10      2 
    page1     user3     9       3 
    page1     user4     8       4
    page1     user5     7       5 
    page2     user1     20      1
    page2     user2     19      2  
    page2     user3     18      3
    
    0 讨论(0)
  • 2020-12-07 18:51

    Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

    SELECT page-id, user-id, clicks
    FROM (
        SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
            SELECT page-id, user-id, clicks FROM mytable
            DISTRIBUTE BY page-id
            SORT BY page-id, clicks desc
    ) a ) b
    WHERE rank < 5
    ORDER BY page-id, rank
    

    Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

    0 讨论(0)
提交回复
热议问题