How to select one row randomly taking into account a weight?

后端未结

关注

 7  1948

I have a table which looks like that:

id: primary key
content: varchar
weight: int

What I want to do is randomly select one row from this t

相关标签:

7条回答

佛祖请我去吃肉

2020-11-29 12:39
A simple approach (avoiding joins or subqueries) is to just multiply the weight by a random number between 0 and 1 to produce a temporary weight to sort by:
```
SELECT t.*, RAND() * t.weight AS w 
FROM table t 
ORDER BY w DESC
LIMIT 1
```
To understand this, consider that RAND() * 2x will be a larger value than RAND() * x approximately two thirds of the time. Consequently, over time each row should be selected with a frequency that's proportional to its relative weight (eg. a row with weight 100 will be selected about 100 times more often than a row with weight 1, etc).

Update: this method doesn't in fact produce the correct distributions, so for now don't use it! (see the comments below). I think there should still be a simple method similar to the above that will work, but for now the more complex method below, involving joins, might be better. I'm leaving this answer up because: (a) there's relevant discussion in the comments below, and (b) if/when I get a chance, I'll try to fix it.
0 讨论(0)
发布评论:

提交评论
- 加载中...

清酒与你

2020-11-29 12:39

Maybe this one:

SELECT * FROM <Table> T JOIN (SELECT FLOOR(MAX(ID)*RAND()) AS ID FROM <Table> ) AS x ON T.ID >= x.ID LIMIT 1;

Or this one:

SELECT * FROM tablename
          WHERE somefield='something'
          ORDER BY RAND() LIMIT 1

0 讨论(0)

挽巷

2020-11-29 12:44
I don't remember how to RND() in mysql, but here working example for MSSQL:
```
SELECT TOP(1) (weight +RAND ()) r, id, content, weight FROM Table
ORDER BY 1 DESC
```
If TOP(1) is not applicable you just fetch first record from total result set.
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-11-29 12:46
I have tried van's solution and, although it works, it is not quick.

My Solution

The way that I am solving this problem is by maintaining a separate, linked table for the weighting. The basic table structure is similar to this:
```
CREATE TABLE `table1` (
  `id` int(11) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  `name` varchar(100),
  `weight` tinyint(4) NOT NULL DEFAULT '1',
);

CREATE TABLE `table1_weight` (
  `id` bigint(20) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  `table1_id` int(11) NOT NULL
);
```
If I have a record in table1 with a weight of 3, then I create 3 records in table1_weight, linked to table1 via the table1_id field. Whatever the value of weight is in table1, that's how many linked records I create in table1_weight.

Testing

On a dataset with 976 records in table1 with a total weight of 2031 and hence 2031 records in table1_weight, I ran the following two SQLs:
1. A version of van's solution
```
SELECT t.*
FROM table1 t
INNER JOIN
  ( SELECT t.id,
       SUM(tt.weight) AS cum_weight
   FROM table1 t
   INNER JOIN table1 tt ON tt.id <= t.id
   GROUP BY t.id) tc ON tc.id = t.id,
  ( SELECT SUM(weight) AS total_weight
   FROM table1) tt,
  ( SELECT RAND() AS rnd) r
WHERE r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY t.id ASC
LIMIT 1
```
2. Joining to a secondary table for the weighting
```
SELECT t.*
FROM table1 t
INNER JOIN table1_weight w
    ON w.table1_id = t.id
ORDER BY RAND()
LIMIT 1
```
SQL 1 takes consistently 0.4 seconds.

SQL 2 takes between 0.01 and 0.02 seconds.

Conclusion

If the speed of selection of a random, weighted record is not an issue, then the single table SQL suggested by van is fine and doesn't have the overhead of maintaining a separate table.

If, as in my case, a short selection time is critical, then I would recommend the two table method.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-29 12:51
This one seems to work, but I'm not sure of the math behind it.
```
SELECT RAND() / t.weight AS w, t.* 
FROM table t 
WHERE t.weight > 0
ORDER BY 1
LIMIT 1
```
My guess at the reason it works is that the ascending order looks for the smallest results and by dividing by the weight for higher weights the random result is clustered more tightly near zero.

I tested it (actually the same algorithm in postgresql) with 209000 queries over 3000 rows and the weight representation came out correct.

my input data:
```
select count(*),weight from t group by weight
 count | weight 
-------+--------
  1000 |     99
  1000 |     10
  1000 |    100
(3 rows)
```
my results:
```
jasen=# with g as ( select generate_series(1,209000) as i )
,r as (select (  select t.weight as w 
    FROM  t 
    WHERE t.weight > 0
    ORDER BY ( random() / t.weight ) + (g.i*0)  LIMIT 1 ) from g)

select r.w, count(*), r.w*1000 as expect from r group by r.w;

  w  | count | expect 
-----+-------+--------
  99 | 98978 |  99000
  10 | 10070 |  10000
 100 | 99952 | 100000
(3 rows)
```
The +(g.i*0) has no effect on the arithmetic result but an external reference is required to to force the planner to re-evaluate the sub-select for each of the 209K input rows produced in in g
0 讨论(0)
发布评论:

提交评论
- 加载中...

臣服心动

2020-11-29 12:57

This works in MSSQL and I am sure that it should be possible to change couple of keywords to make it work in MySQL as well (maybe even nicer):

SELECT      TOP 1 t.*
FROM        @Table t
INNER JOIN (SELECT      t.id, sum(tt.weight) AS cum_weight
            FROM        @Table t
            INNER JOIN  @Table tt ON  tt.id <= t.id
            GROUP BY    t.id) tc
        ON  tc.id = t.id,
           (SELECT  SUM(weight) AS total_weight FROM @Table) tt,
           (SELECT  RAND() AS rnd) r
WHERE       r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY    t.id ASC

The idea is to have a cumulative weight for each row (subselect-1), then find the position of the spanned RAND() in this cumulative range.

0 讨论(0)

1 2 下一页

How to select one row randomly taking into account a weight?

My Solution

Testing

Conclusion