How to request a random row in SQL?

前端未结

关注

 29  2952

孤城傲影

How can I request a random row (or as close to truly random as is possible) in pure SQL?

相关标签:

29条回答

南旧

2020-11-21 07:17
You didn't say which server you're using. In older versions of SQL Server, you can use this:
```
select top 1 * from mytable order by newid()
```
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
```
SELECT FirstName, LastName
FROM Contact 
TABLESAMPLE (1 ROWS) ;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-21 07:17
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

For MS SQL:

Minimum example:
```
select top 10 percent *
from table_name
order by rand(checksum(*))
```
Normalized execution time: 1.00

NewId() example:
```
select top 10 percent *
from table_name
order by newid()
```
Normalized execution time: 1.02

NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

Selection with Initial Seed:
```
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
```
If you need to select the same set given a seed, this seems to work.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-21 07:21
In MSSQL (tested on 11.0.5569) using
```
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
```
is significantly faster than
```
SELECT TOP 100 * FROM employee ORDER BY NEWID()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-11-21 07:23
I don't know how efficient this is, but I've used it before:
```
SELECT TOP 1 * FROM MyTable ORDER BY newid()
```
Because GUIDs are pretty random, the ordering means you get a random row.
0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-11-21 07:23
In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.

Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
```
SELECT TOP 1
  word
FROM (
  SELECT TOP(@idx)
    word 
  FROM
    dbo.DictionaryAbridged WITH(NOLOCK)
  ORDER BY
    word DESC
) AS D
ORDER BY
  word ASC
```
Of course, @idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.

Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-11-21 07:27
For SQL Server

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
```
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)
```
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
```
SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF
```
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
0 讨论(0)
发布评论:

提交评论
- 加载中...