Slow distinct query in SQL Server over large dataset

前端 未结 10 2606
情深已故
情深已故 2021-02-14 00:27

We\'re using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that on

相关标签:
10条回答
  • 2021-02-14 00:50

    A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).

    Idea was from this question:

    select typeName into #Result from Types where 1=0;
    
    declare @t varchar(100) = (select min(typeName) from Types);
    while @t is not null
    begin
        set @t = (select top 1 typeName from Types where typeName > @t order by typeName);    
        if (@t is not null)
            insert into #Result values (@t);
    end
    
    select * from #Result;
    

    And looks like there are also some other methods (notably the recursive CTE @Paul White):

    different-ways-to-find-distinct-values-faster-methods

    sqlservercentral Topic873124-338-5

    0 讨论(0)
  • 2021-02-14 00:50

    An indexed view can make this faster.

    create view alltypes
    with schemabinding as
    select typename, count_big(*) as kount
    from dbo.types
    group by typename
    
    create unique clustered index idx
    on alltypes (typename)
    

    The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)

    Alternatively you could make a small table holding all values:

    select distinct typename
    into alltypes
    from types
    
    alter table alltypes
    add primary key (typename)
    
    alter table types add foreign key (typename) references alltypes
    

    The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.

    0 讨论(0)
  • 2021-02-14 00:56

    An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.

    You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:

    select type
    from nightlyscan
    union
    select distinct type
    from verybigtable
    where rowid > lastscannedid
    

    Another option is to normalize the big table into two tables:

    talbe1: id, guid, typeid
    type table: typeid, typename
    

    This would be very beneficial if the number of types was relatively small.

    0 讨论(0)
  • 2021-02-14 00:56

    My first thought is statistics. To find last updated:

    SELECT
        name AS index_name, 
        STATS_DATE(object_id, index_id) AS statistics_update_date
    FROM
        sys.indexes 
    WHERE
        object_id = OBJECT_ID('MyTable');
    

    Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

    My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.

    0 讨论(0)
  • 2021-02-14 00:58

    There is an isse with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.

    So we too queries such as:

    SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
    

    and break it up into the following

    SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
    SELECT DISTINCT typeName FROM #tempTable1
    

    Another way to get around it is to use a GROUP BY, which gets a different optimization plan.

    0 讨论(0)
  • 2021-02-14 00:58

    As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

    So it's really a matter of limiting the number of rows that need to be scanned.

    The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

    If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

    That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

    Marc

    0 讨论(0)
提交回复
热议问题