问题
I have a large table containing a userID
column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID
. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID
is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userID
s selected too.
回答1:
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.
来源:https://stackoverflow.com/questions/31190498/random-sample-table-with-hive-but-including-matching-rows