Matched size random samples from hive table

后端 未结 3 899
情深已故
情深已故 2021-01-15 04:19

I have a hive table activity with columns userid, itemid, and rating, with possible ratings of 1 and 0, in which there ar

相关标签:
3条回答
  • 2021-01-15 04:46

    If you know in advance that negatives are the limiting factor, you can get the exact number with the first query (let's say N). Then you can get the entire sample with (hardcode N here)

    select * from
    (
      select * from activity where rating=1 order by rand() limit N
      union all
      select * from activity where rating=0  
    ) all_sample
    order by rand() limit 2N
    

    the last order may not be necessary, depending on your need.

    0 讨论(0)
  • 2021-01-15 05:01

    If there are a lot of classes, you can use the following query to get samples across all the classes without writing the query multiple times:

    select * from 
        (select userid, item_id, rating, 
        row_number() over(partition by rating  order by rand()) as rn 
        from activity
        ) a 
    where rn <= x
    

    x can be whatever the count you want each class to be of.

    0 讨论(0)
  • 2021-01-15 05:08

    As is suggested here (http://www.joefkelley.com/736/), rand() will only randomize across a single reducer. If you have a skew of data in your dataset (e.g., if you don't distribute by the key you are randomizing) you may see skewed results.

    First be sure you are using DISTRIBUTE BY your key, then use rand() with limit to return N values.

    0 讨论(0)
提交回复
热议问题