Using NEWID() with CTE to produce random subset of rows produces odd results

纵然是瞬间 提交于 2019-12-19 10:25:06

问题


I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.

The report starts with a Group of Users and a filter is applied to specify the total number of random rows required (@SampleLimit).

To achieve the desired result, I start by creating a CTE (temp table) with:

  • The top(@SampleLimit) applied
  • group by UserId (as the UserID appears multiple times)
  • order by NEWID() to put the results in a random order

SQL:

; with cte_temp as 
       (select top(@SampleLimit) UserId from QueryResults 
        where (GroupId = @GroupId)
        group by UserId order by NEWID()) 

Once I have this result set, I then delete any results where the UserId is NOT IN the CTE created in the previous step.

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))

The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit and other times it works exactly as expected.

I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.

Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?

For completeness - my re-factored solution based on below answer:

select top(@SampleLimit) UserId into #T1
from  QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID() 

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))

回答1:


It is undeterministic how many times the SELECT statement involving NEWID() will be executed.

If you get a nested loops anti semi join between QueryResults and cte_temp and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults this means that for each outer row the set that is being compared against with NOT IN may be entirely different.

Instead of using a CTE you can materialize the results into a temporary table to avoid this.

INSERT INTO #T
SELECT TOP(@SampleLimit) UserId
FROM   QueryResults
WHERE  ( GroupId = @GroupId )
GROUP  BY UserId
ORDER  BY NEWID() 

Then reference that in the DELETE



来源:https://stackoverflow.com/questions/17110816/using-newid-with-cte-to-produce-random-subset-of-rows-produces-odd-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!