Select n random rows from SQL Server table

前端 未结 16 834
陌清茗
陌清茗 2020-11-22 10:54

I\'ve got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I\'ve thought of a complicated way, creating a temp table wi

相关标签:
16条回答
  • 2020-11-22 11:42

    This works for me:

    SELECT * FROM table_name
    ORDER BY RANDOM()
    LIMIT [number]
    
    0 讨论(0)
  • 2020-11-22 11:43
    select top 10 percent * from [yourtable] order by newid()
    

    In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

    select  * from [yourtable] where [yourPk] in 
    (select top 10 percent [yourPk] from [yourtable] order by newid())
    

    The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

    0 讨论(0)
  • 2020-11-22 11:43

    Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.

      SELECT * FROM Table1
      WHERE (ABS(CAST(
      (BINARY_CHECKSUM(*) *
      RAND()) as int)) % 100) < 10
    
    0 讨论(0)
  • 2020-11-22 11:43

    Just order the table by a random number and obtain the first 5,000 rows using TOP.

    SELECT TOP 5000 * FROM [Table] ORDER BY newid();
    

    UPDATE

    Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

    0 讨论(0)
  • 2020-11-22 11:43

    This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():

    SELECT TOP [number] 
    FROM table_name
    ORDER BY RAND(CHECKSUM(*) * RAND())
    
    0 讨论(0)
  • 2020-11-22 11:44

    newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

    TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

    For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

    If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

    SELECT * FROM Sales.SalesOrderDetail
    WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
                  / CAST (0x7fffffff AS int)
    

    The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

    When run against a table with 1,000,000 rows, here are my results:

    SET STATISTICS TIME ON
    SET STATISTICS IO ON
    
    /* newid()
       rows returned: 10000
       logical reads: 3359
       CPU time: 3312 ms
       elapsed time = 3359 ms
    */
    SELECT TOP 1 PERCENT Number
    FROM Numbers
    ORDER BY newid()
    
    /* TABLESAMPLE
       rows returned: 9269 (varies)
       logical reads: 32
       CPU time: 0 ms
       elapsed time: 5 ms
    */
    SELECT Number
    FROM Numbers
    TABLESAMPLE (1 PERCENT)
    
    /* Filter
       rows returned: 9994 (varies)
       logical reads: 3359
       CPU time: 641 ms
       elapsed time: 627 ms
    */    
    SELECT Number
    FROM Numbers
    WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
                  / CAST (0x7fffffff AS int)
    
    SET STATISTICS IO OFF
    SET STATISTICS TIME OFF
    

    If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

    0 讨论(0)
提交回复
热议问题