Biased random in SQL?

后端 未结 3 1893
一向
一向 2021-01-15 19:35

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a b

相关标签:
3条回答
  • 2021-01-15 19:51

    You need to generate a random number per row and weight it.

    In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost

    DECLARE @sample TABLE (id int, boost int)
    
    INSERT @sample VALUES (1, 1), (2, 2), (3, 7)
    
    SELECT
        RAND(CHECKSUM(NEWID())) * boost  AS weighted,
        SUM(boost) OVER () AS boostcount,
        id
    FROM
        @sample
    GROUP BY
        id, boost
    ORDER BY
        weighted DESC
    

    If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.

    Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.

    This sample was put together on SQL Server 2008, BTW

    0 讨论(0)
  • 2021-01-15 19:52

    My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".

    Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.

    What I did and in your case would look a little bit like this:

    (SELECT id, boost FROM foo) AS values
    INNER JOIN (
        SELECT id % 100 + 1 AS counter 
        FROM user 
        GROUP BY counter) AS numbers ON numbers.counter <= values.boost
    ORDER BY RAND()
    

    Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.

    Before I used this query I checked two things:

    1. The maximum number of boost is less than the maximum returned in the number query
    2. That the inner query returns ALL numbers between 1..100. It might not depending on your table!

    Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.

    Finally I let MySql sort it randomly.

    Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost

    0 讨论(0)
  • 2021-01-15 19:55

    I dare to suggest straightforward solution with two queries, using cumulative boost calculation.

    First, select sum of boosts, and generate some number between 0 and boost sum:

    select ceil(rand() * sum(boost)) from table;
    

    This value should be stored as a variable, let's call it {random_number}

    Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:

    SET @cumulative_boost=0;
    SELECT
      id,
      @cumulative_boost:=(@cumulative_boost + boost) AS cumulative_boost,
    FROM
      table
    WHERE
      cumulative_boost >= {random_number}
    ORDER BY id
    LIMIT 1;
    
    0 讨论(0)
提交回复
热议问题