Best way to select random rows PostgreSQL

前端未结

关注

 12  1009

I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

But some other recommend this:

相关标签:

12条回答

南旧

2020-11-22 07:22
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.

Say, for example, that you don't want duplicates in the randomized values that are returned. So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.

Assuming this is the input table:
```
id_values  id  |   used
           ----+--------
           1   |   FALSE
           2   |   FALSE
           3   |   FALSE
           4   |   FALSE
           5   |   FALSE
           ...
```
Populate the ID_VALUES table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once:
```
CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();
```
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values table.

In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values, selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities. For example:
```
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 5)
RETURNING id;
```
Change LIMIT as necessary -- if you only need one random value at a time, change LIMIT to 1.

With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values table.
0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-11-22 07:24
Add a column called r with type serial. Index r.

Assume we have 200,000 rows, we are going to generate a random number n, where 0 < n <= 200, 000.

Select rows with r > n, sort them ASC and select the smallest one.

Code:
```
select * from YOUR_TABLE 
where r > (
    select (
        select reltuples::bigint AS estimate
        from   pg_class
        where  oid = 'public.YOUR_TABLE'::regclass) * random()
    )
order by r asc limit(1);
```
The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .

In application level you need to execute the statement again if n > the number of rows or need to select multiple rows.
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-11-22 07:27
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
```
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
```
This example will give you 5% of elements from mytable.

See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-11-22 07:29
I know I'm a little late to the party, but I just found this awesome tool called pg_sample:

pg_sample - extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity.

I tried this with a 350M rows database and it was really fast, don't know about the randomness.
```
./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-22 07:33

One lesson from my experience:

offset floor(random() * N) limit 1 is not faster than order by random() limit 1.

I thought the offset approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-11-22 07:34
postgresql order by random(), select rows in random order:
```
select your_columns from your_table ORDER BY random()
```
postgresql order by random() with a distinct:
```
select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()
```
postgresql order by random limit one row:
```
select your_columns from your_table ORDER BY random() limit 1
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Best way to select random rows PostgreSQL

postgresql order by random(), select rows in random order:

postgresql order by random() with a distinct:

postgresql order by random limit one row: