Slow distinct query in SQL Server over large dataset

前端未结

关注

 10  2606

We\'re using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that on

相关标签:

10条回答

傲寒

2021-02-14 00:50
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).

Idea was from this question:
```
select typeName into #Result from Types where 1=0;

declare @t varchar(100) = (select min(typeName) from Types);
while @t is not null
begin
    set @t = (select top 1 typeName from Types where typeName > @t order by typeName);    
    if (@t is not null)
        insert into #Result values (@t);
end

select * from #Result;
```
And looks like there are also some other methods (notably the recursive CTE @Paul White):

different-ways-to-find-distinct-values-faster-methods

sqlservercentral Topic873124-338-5
0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2021-02-14 00:50
An indexed view can make this faster.
```
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename

create unique clustered index idx
on alltypes (typename)
```
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)

Alternatively you could make a small table holding all values:
```
select distinct typename
into alltypes
from types

alter table alltypes
add primary key (typename)

alter table types add foreign key (typename) references alltypes
```
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-02-14 00:56
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.

You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
```
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
```
Another option is to normalize the big table into two tables:
```
talbe1: id, guid, typeid
type table: typeid, typename
```
This would be very beneficial if the number of types was relatively small.
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2021-02-14 00:56
My first thought is statistics. To find last updated:
```
SELECT
    name AS index_name, 
    STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
    sys.indexes 
WHERE
    object_id = OBJECT_ID('MyTable');
```
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-02-14 00:58
There is an isse with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.

So we too queries such as:
```
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
```
and break it up into the following
```
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
```
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-14 00:58

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

So it's really a matter of limiting the number of rows that need to be scanned.

The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

Marc

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页