MySQL select 10 random rows from 600K rows fast

后端未结

关注

 26  2902

How can I best write a query that selects 10 rows randomly from a total of 600k?

You can easily use a random offset with a limit

PREPARE stm from 'select * from table limit 10 offset ?';
SET @total = (select count(*) from table);
SET @_offset = FLOOR(RAND() * @total);
EXECUTE stm using @_offset;

You can also apply a where clause like so

PREPARE stm from 'select * from table where available=true limit 10 offset ?';
SET @total = (select count(*) from table where available=true);
SET @_offset = FLOOR(RAND() * @total);
EXECUTE stm using @_offset;

Tested on 600,000 rows (700MB) table query execution took ~0.016sec HDD drive.

EDIT: The offset might take a value close to the end of the table, which will result in the select statement returning less rows (or maybe only 1 row), to avoid this we can check the offset again after declaring it, like so

SET @rows_count = 10;
PREPARE stm from "select * from table where available=true limit ? offset ?";
SET @total = (select count(*) from table where available=true);
SET @_offset = FLOOR(RAND() * @total);
SET @_offset = (SELECT IF(@total-@_offset<@rows_count,@_offset-@rows_count,@_offset));
SET @_offset = (SELECT IF(@_offset<0,0,@_offset));
EXECUTE stm using @rows_count,@_offset;

0 讨论(0)

不要未来只要你来

2020-11-21 05:29

I used this http://jan.kneschke.de/projects/mysql/order-by-rand/ posted by Riedsio (i used the case of a stored procedure that returns one or more random values):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        INSERT INTO rands
           SELECT r1.id
             FROM random AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT MAX(id)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.id >= r2.id
            ORDER BY r1.id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

In the article he solves the problem of gaps in ids causing not so random results by maintaining a table (using triggers, etc...see the article); I'm solving the problem by adding another column to the table, populated with contiguous numbers, starting from 1 (edit: this column is added to the temporary table created by the subquery at runtime, doesn't affect your permanent table):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        SET @no_gaps_id := 0;

        INSERT INTO rands
           SELECT r1.id
             FROM (SELECT id, @no_gaps_id := @no_gaps_id + 1 AS no_gaps_id FROM random) AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT COUNT(*)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.no_gaps_id >= r2.id
            ORDER BY r1.no_gaps_id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

In the article i can see he went to great lengths to optimize the code; i have no ideea if/how much my changes impact the performance but works very well for me.

0 讨论(0)

终归单人心

2020-11-21 05:29
If you want one random record (no matter if there are gapes between ids):
```
PREPARE stmt FROM 'SELECT * FROM `table_name` LIMIT 1 OFFSET ?';
SET @count = (SELECT
        FLOOR(RAND() * COUNT(*))
    FROM `table_name`);

EXECUTE stmt USING @count;
```
Source: https://www.warpconduit.net/2011/03/23/selecting-a-random-record-using-mysql-benchmark-results/#comment-1266
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-21 05:30

All the best answers have been already posted (mainly those referencing the link http://jan.kneschke.de/projects/mysql/order-by-rand/).

I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.

When using this caching you can use also some of the slower solution for getting the random data as it will be fetched from MySQL only once per second regardless of your req/s.

0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2020-11-21 05:31

I think here is a simple and yet faster way, I tested it on the live server in comparison with a few above answer and it was faster.

 SELECT * FROM `table_name` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table_name` ) ORDER BY id LIMIT 30;

//Took 0.0014secs against a table of 130 rows

SELECT * FROM `table_name` WHERE 1 ORDER BY RAND() LIMIT 30

//Took 0.0042secs against a table of 130 rows

 SELECT name
FROM random AS r1 JOIN
   (SELECT CEIL(RAND() *
                 (SELECT MAX(id)
                    FROM random)) AS id)
    AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 30

//Took 0.0040secs against a table of 130 rows

0 讨论(0)

别那么骄傲

2020-11-21 05:32
One way that i find pretty good if there's an autogenerated id is to use the modulo operator '%'. For Example, if you need 10,000 random records out 70,000, you could simplify this by saying you need 1 out of every 7 rows. This can be simplified in this query:
```
SELECT * FROM 
    table 
WHERE 
    id % 
    FLOOR(
        (SELECT count(1) FROM table) 
        / 10000
    ) = 0;
```
If the result of dividing target rows by total available is not an integer, you will have some extra rows than what you asked for, so you should add a LIMIT clause to help you trim the result set like this:
```
SELECT * FROM 
    table 
WHERE 
    id % 
    FLOOR(
        (SELECT count(1) FROM table) 
        / 10000
    ) = 0
LIMIT 10000;
```
This does require a full scan, but it is faster than ORDER BY RAND, and in my opinion simpler to understand than other options mentioned in this thread. Also if the system that writes to the DB creates sets of rows in batches you might not get such a random result as you where expecting.
0 讨论(0)
发布评论:

提交评论
- 加载中...