Biased random in SQL?

后端未结

关注

 3  1894

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a b

相关标签:

3条回答

闹比i

2021-01-15 19:51
You need to generate a random number per row and weight it.

In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
```
DECLARE @sample TABLE (id int, boost int)

INSERT @sample VALUES (1, 1), (2, 2), (3, 7)

SELECT
    RAND(CHECKSUM(NEWID())) * boost  AS weighted,
    SUM(boost) OVER () AS boostcount,
    id
FROM
    @sample
GROUP BY
    id, boost
ORDER BY
    weighted DESC
```
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.

Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.

This sample was put together on SQL Server 2008, BTW
0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2021-01-15 19:52
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".

Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.

What I did and in your case would look a little bit like this:
```
(SELECT id, boost FROM foo) AS values
INNER JOIN (
    SELECT id % 100 + 1 AS counter 
    FROM user 
    GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
```
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.

Before I used this query I checked two things:
1. The maximum number of boost is less than the maximum returned in the number query
2. That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.

Finally I let MySql sort it randomly.

Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2021-01-15 19:55
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.

First, select sum of boosts, and generate some number between 0 and boost sum:
```
select ceil(rand() * sum(boost)) from table;
```
This value should be stored as a variable, let's call it {random_number}

Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
```
SET @cumulative_boost=0;
SELECT
  id,
  @cumulative_boost:=(@cumulative_boost + boost) AS cumulative_boost,
FROM
  table
WHERE
  cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...