Random sample table with Hive, but including matching rows

问题

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.

I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:

SELECT * FROM source 
TABLESAMPLE (1 PERCENT) s;

but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.

回答1:

You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.

select c.*
from 
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
    from
    (select userID 
    from mytable 
    group by userID
    ) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;

This is set up for entity level modeling purposes, which is why I have test and train as types.

来源：https://stackoverflow.com/questions/31190498/random-sample-table-with-hive-but-including-matching-rows

标签

Hive

hiveql

random-sample

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!