I have a hive table activity
with columns userid
, itemid
, and rating
, with possible ratings of 1 and 0, in which there ar
If you know in advance that negatives are the limiting factor, you can get the exact number with the first query (let's say N). Then you can get the entire sample with (hardcode N here)
select * from
(
select * from activity where rating=1 order by rand() limit N
union all
select * from activity where rating=0
) all_sample
order by rand() limit 2N
the last order may not be necessary, depending on your need.