Select n random rows from SQL Server table

前端未结

关注

 16  869

I\'ve got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I\'ve thought of a complicated way, creating a temp table wi

相关标签:

16条回答

天涯浪人

2020-11-22 11:31

select * from table where id in ( select id from table order by random() limit ((select count(*) from table)*55/100))

// to select 55 percent of rows randomly

0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-11-22 11:32
Depending on your needs, TABLESAMPLE will get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

On a very large table I tested
```
select top 1 percent * from [tablename] order by newid()
```
took more than 20 minutes.
```
select * from [tablename] tablesample(1 percent)
```
took 2 minutes.

Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().

Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.

See the MSDN page.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-11-22 11:35

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net

ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-11-22 11:36
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

For MS SQL:

Minimum example:
```
select top 10 percent *
from table_name
order by rand(checksum(*))
```
Normalized execution time: 1.00

NewId() example:
```
select top 10 percent *
from table_name
order by newid()
```
Normalized execution time: 1.02

NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

Selection with Initial Seed:
```
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */
```
If you need to select the same set given a seed, this seems to work.
0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-11-22 11:38

I was using it in subquery and it returned me same rows in subquery

 SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

then i solved with including parent table variable in where

SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              Where Mytable.ID>0
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

Note the where condtition

0 讨论(0)

难免孤独

2020-11-22 11:39
This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.
```
SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()
```
However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
```
SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10
```
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

Please read the full article in MSDN.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页