问题
I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.
The report starts with a Group
of Users
and a filter is applied to specify the total number of random rows required (@SampleLimit
).
To achieve the desired result, I start by creating a CTE (temp table) with:
- The
top(@SampleLimit)
applied group by UserId
(as the UserID appears multiple times)order by NEWID()
to put the results in a random order
SQL:
; with cte_temp as
(select top(@SampleLimit) UserId from QueryResults
where (GroupId = @GroupId)
group by UserId order by NEWID())
Once I have this result set, I then delete any results where the UserId is NOT IN
the CTE created in the previous step.
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))
The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit
and other times it works exactly as expected.
I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.
Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?
For completeness - my re-factored solution based on below answer:
select top(@SampleLimit) UserId into #T1
from QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID()
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))
回答1:
It is undeterministic how many times the SELECT
statement involving NEWID()
will be executed.
If you get a nested loops anti semi join between QueryResults
and cte_temp
and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults
this means that for each outer row the set that is being compared against with NOT IN
may be entirely different.
Instead of using a CTE you can materialize the results into a temporary table to avoid this.
INSERT INTO #T
SELECT TOP(@SampleLimit) UserId
FROM QueryResults
WHERE ( GroupId = @GroupId )
GROUP BY UserId
ORDER BY NEWID()
Then reference that in the DELETE
来源:https://stackoverflow.com/questions/17110816/using-newid-with-cte-to-produce-random-subset-of-rows-produces-odd-results