Get top 1 row of each group

后端 未结 20 3041
余生分开走
余生分开走 2020-11-21 04:42

I have a table which I want to get the latest entry for each group. Here\'s the table:

DocumentStatusLogs Table

|ID| DocumentID | Status         


        
20条回答
  •  粉色の甜心
    2020-11-21 05:11

    Verifying Clint's awesome and correct answer from above:

    The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.

    IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
    
    CREATE TABLE #DocumentStatusLogs (
        [ID] int NOT NULL,
        [DocumentID] int NOT NULL,
        [Status] varchar(20),
        [DateCreated] datetime
    )
    
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
    

    Option 1:

        SELECT
        [Extent1].[ID], 
        [Extent1].[DocumentID],
        [Extent1].[Status], 
        [Extent1].[DateCreated]
    FROM #DocumentStatusLogs AS [Extent1]
        OUTER APPLY (
            SELECT TOP 1
                [Extent2].[ID], 
                [Extent2].[DocumentID],
                [Extent2].[Status], 
                [Extent2].[DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
            ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
        ) AS [Project2]
    WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
    

    Option 2:

    SELECT 
        [Limit1].[DocumentID] AS [ID], 
        [Limit1].[DocumentID] AS [DocumentID], 
        [Limit1].[Status] AS [Status], 
        [Limit1].[DateCreated] AS [DateCreated]
    FROM (
        SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
    ) AS [Distinct1]
        OUTER APPLY  (
            SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
            FROM (
                SELECT 
                    [Extent2].[ID] AS [ID], 
                    [Extent2].[DocumentID] AS [DocumentID], 
                    [Extent2].[Status] AS [Status], 
                    [Extent2].[DateCreated] AS [DateCreated]
                FROM #DocumentStatusLogs AS [Extent2]
                WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
            )  AS [Project2]
            ORDER BY [Project2].[ID] DESC
        ) AS [Limit1]
    

    M$'s Management Studio: After highlighting and running the first block, highlight both Option 1 and Option 2, Right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.

    Option 1 Results:

    ID  DocumentID  Status  DateCreated
    6   1   S1  8/2/11 3:00
    5   2   S3  8/1/11 6:00
    6   3   S1  8/2/11 7:00
    

    Option 2 Results:

    ID  DocumentID  Status  DateCreated
    6   1   S1  8/2/11 3:00
    5   2   S3  8/1/11 6:00
    6   3   S1  8/2/11 7:00
    

    Note:

    I tend to use APPLY when I want a join to be 1-to-(1 of many).

    I use a JOIN if I want the join to be 1-to-many, or many-to-many.

    I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.

    I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!

提交回复
热议问题