发表新帖

发表新帖

randomizing large dataset

前端未结

关注

 3  1756

失恋的感觉 2021-01-15 14:58

I am trying to find a way to get a random selection from a large dataset.

We expect the set to grow to ~500K records, so it is important to find a way that keeps per

3条回答

终归单人心 (楼主)

2021-01-15 15:09
You could solve this with some denormalization:
- Build a secondary table that contains the same pkeys and statuses as your data table
- Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
```
Pkey    Status    StatusPkey
1       A         1
2       A         2
3       B         1
4       B         2
5       C         1
...     C         ...
n       C         m (where m = # of C statuses)
```
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.

There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题