MySQL select 10 random rows from 600K rows fast

后端 未结 26 2902
粉色の甜心
粉色の甜心 2020-11-21 05:06

How can I best write a query that selects 10 rows randomly from a total of 600k?

相关标签:
26条回答
  • 2020-11-21 05:28

    You can easily use a random offset with a limit

    PREPARE stm from 'select * from table limit 10 offset ?';
    SET @total = (select count(*) from table);
    SET @_offset = FLOOR(RAND() * @total);
    EXECUTE stm using @_offset;
    

    You can also apply a where clause like so

    PREPARE stm from 'select * from table where available=true limit 10 offset ?';
    SET @total = (select count(*) from table where available=true);
    SET @_offset = FLOOR(RAND() * @total);
    EXECUTE stm using @_offset;
    

    Tested on 600,000 rows (700MB) table query execution took ~0.016sec HDD drive.

    EDIT: The offset might take a value close to the end of the table, which will result in the select statement returning less rows (or maybe only 1 row), to avoid this we can check the offset again after declaring it, like so

    SET @rows_count = 10;
    PREPARE stm from "select * from table where available=true limit ? offset ?";
    SET @total = (select count(*) from table where available=true);
    SET @_offset = FLOOR(RAND() * @total);
    SET @_offset = (SELECT IF(@total-@_offset<@rows_count,@_offset-@rows_count,@_offset));
    SET @_offset = (SELECT IF(@_offset<0,0,@_offset));
    EXECUTE stm using @rows_count,@_offset;
    
    0 讨论(0)
  • I used this http://jan.kneschke.de/projects/mysql/order-by-rand/ posted by Riedsio (i used the case of a stored procedure that returns one or more random values):

       DROP TEMPORARY TABLE IF EXISTS rands;
       CREATE TEMPORARY TABLE rands ( rand_id INT );
    
        loop_me: LOOP
            IF cnt < 1 THEN
              LEAVE loop_me;
            END IF;
    
            INSERT INTO rands
               SELECT r1.id
                 FROM random AS r1 JOIN
                      (SELECT (RAND() *
                                    (SELECT MAX(id)
                                       FROM random)) AS id)
                       AS r2
                WHERE r1.id >= r2.id
                ORDER BY r1.id ASC
                LIMIT 1;
    
            SET cnt = cnt - 1;
          END LOOP loop_me;
    

    In the article he solves the problem of gaps in ids causing not so random results by maintaining a table (using triggers, etc...see the article); I'm solving the problem by adding another column to the table, populated with contiguous numbers, starting from 1 (edit: this column is added to the temporary table created by the subquery at runtime, doesn't affect your permanent table):

       DROP TEMPORARY TABLE IF EXISTS rands;
       CREATE TEMPORARY TABLE rands ( rand_id INT );
    
        loop_me: LOOP
            IF cnt < 1 THEN
              LEAVE loop_me;
            END IF;
    
            SET @no_gaps_id := 0;
    
            INSERT INTO rands
               SELECT r1.id
                 FROM (SELECT id, @no_gaps_id := @no_gaps_id + 1 AS no_gaps_id FROM random) AS r1 JOIN
                      (SELECT (RAND() *
                                    (SELECT COUNT(*)
                                       FROM random)) AS id)
                       AS r2
                WHERE r1.no_gaps_id >= r2.id
                ORDER BY r1.no_gaps_id ASC
                LIMIT 1;
    
            SET cnt = cnt - 1;
          END LOOP loop_me;
    

    In the article i can see he went to great lengths to optimize the code; i have no ideea if/how much my changes impact the performance but works very well for me.

    0 讨论(0)
  • 2020-11-21 05:29

    If you want one random record (no matter if there are gapes between ids):

    PREPARE stmt FROM 'SELECT * FROM `table_name` LIMIT 1 OFFSET ?';
    SET @count = (SELECT
            FLOOR(RAND() * COUNT(*))
        FROM `table_name`);
    
    EXECUTE stmt USING @count;
    

    Source: https://www.warpconduit.net/2011/03/23/selecting-a-random-record-using-mysql-benchmark-results/#comment-1266

    0 讨论(0)
  • 2020-11-21 05:30

    All the best answers have been already posted (mainly those referencing the link http://jan.kneschke.de/projects/mysql/order-by-rand/).

    I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.

    When using this caching you can use also some of the slower solution for getting the random data as it will be fetched from MySQL only once per second regardless of your req/s.

    0 讨论(0)
  • 2020-11-21 05:31

    I think here is a simple and yet faster way, I tested it on the live server in comparison with a few above answer and it was faster.

     SELECT * FROM `table_name` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table_name` ) ORDER BY id LIMIT 30; 
    

    //Took 0.0014secs against a table of 130 rows

    SELECT * FROM `table_name` WHERE 1 ORDER BY RAND() LIMIT 30
    

    //Took 0.0042secs against a table of 130 rows

     SELECT name
    FROM random AS r1 JOIN
       (SELECT CEIL(RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
    WHERE r1.id >= r2.id
    ORDER BY r1.id ASC
    LIMIT 30
    

    //Took 0.0040secs against a table of 130 rows

    0 讨论(0)
  • 2020-11-21 05:32

    One way that i find pretty good if there's an autogenerated id is to use the modulo operator '%'. For Example, if you need 10,000 random records out 70,000, you could simplify this by saying you need 1 out of every 7 rows. This can be simplified in this query:

    SELECT * FROM 
        table 
    WHERE 
        id % 
        FLOOR(
            (SELECT count(1) FROM table) 
            / 10000
        ) = 0;
    

    If the result of dividing target rows by total available is not an integer, you will have some extra rows than what you asked for, so you should add a LIMIT clause to help you trim the result set like this:

    SELECT * FROM 
        table 
    WHERE 
        id % 
        FLOOR(
            (SELECT count(1) FROM table) 
            / 10000
        ) = 0
    LIMIT 10000;
    

    This does require a full scan, but it is faster than ORDER BY RAND, and in my opinion simpler to understand than other options mentioned in this thread. Also if the system that writes to the DB creates sets of rows in batches you might not get such a random result as you where expecting.

    0 讨论(0)
提交回复
热议问题