I have a hive table activity
with columns userid
, itemid
, and rating
, with possible ratings of 1 and 0, in which there ar
If you know in advance that negatives are the limiting factor, you can get the exact number with the first query (let's say N). Then you can get the entire sample with (hardcode N here)
select * from
(
select * from activity where rating=1 order by rand() limit N
union all
select * from activity where rating=0
) all_sample
order by rand() limit 2N
the last order may not be necessary, depending on your need.
If there are a lot of classes, you can use the following query to get samples across all the classes without writing the query multiple times:
select * from
(select userid, item_id, rating,
row_number() over(partition by rating order by rand()) as rn
from activity
) a
where rn <= x
x can be whatever the count you want each class to be of.
As is suggested here (http://www.joefkelley.com/736/), rand() will only randomize across a single reducer. If you have a skew of data in your dataset (e.g., if you don't distribute by the key you are randomizing) you may see skewed results.
First be sure you are using DISTRIBUTE BY your key, then use rand() with limit to return N values.