Get top 1 row of each group

后端 未结 20 3049
余生分开走
余生分开走 2020-11-21 04:42

I have a table which I want to get the latest entry for each group. Here\'s the table:

DocumentStatusLogs Table

|ID| DocumentID | Status         


        
相关标签:
20条回答
  • 2020-11-21 05:11

    I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.

    select top 1 with ties
       DocumentID
      ,Status
      ,DateCreated
    from DocumentStatusLogs
    order by row_number() over (partition by DocumentID order by DateCreated desc)
    

    More about the TOP clause can be found here.

    0 讨论(0)
  • 2020-11-21 05:11

    Verifying Clint's awesome and correct answer from above:

    The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.

    IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
    
    CREATE TABLE #DocumentStatusLogs (
        [ID] int NOT NULL,
        [DocumentID] int NOT NULL,
        [Status] varchar(20),
        [DateCreated] datetime
    )
    
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
    INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
    

    Option 1:

        SELECT
        [Extent1].[ID], 
        [Extent1].[DocumentID],
        [Extent1].[Status], 
        [Extent1].[DateCreated]
    FROM #DocumentStatusLogs AS [Extent1]
        OUTER APPLY (
            SELECT TOP 1
                [Extent2].[ID], 
                [Extent2].[DocumentID],
                [Extent2].[Status], 
                [Extent2].[DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
            ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
        ) AS [Project2]
    WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
    

    Option 2:

    SELECT 
        [Limit1].[DocumentID] AS [ID], 
        [Limit1].[DocumentID] AS [DocumentID], 
        [Limit1].[Status] AS [Status], 
        [Limit1].[DateCreated] AS [DateCreated]
    FROM (
        SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
    ) AS [Distinct1]
        OUTER APPLY  (
            SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
            FROM (
                SELECT 
                    [Extent2].[ID] AS [ID], 
                    [Extent2].[DocumentID] AS [DocumentID], 
                    [Extent2].[Status] AS [Status], 
                    [Extent2].[DateCreated] AS [DateCreated]
                FROM #DocumentStatusLogs AS [Extent2]
                WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
            )  AS [Project2]
            ORDER BY [Project2].[ID] DESC
        ) AS [Limit1]
    

    M$'s Management Studio: After highlighting and running the first block, highlight both Option 1 and Option 2, Right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.

    Option 1 Results:

    ID  DocumentID  Status  DateCreated
    6   1   S1  8/2/11 3:00
    5   2   S3  8/1/11 6:00
    6   3   S1  8/2/11 7:00
    

    Option 2 Results:

    ID  DocumentID  Status  DateCreated
    6   1   S1  8/2/11 3:00
    5   2   S3  8/1/11 6:00
    6   3   S1  8/2/11 7:00
    

    Note:

    I tend to use APPLY when I want a join to be 1-to-(1 of many).

    I use a JOIN if I want the join to be 1-to-many, or many-to-many.

    I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.

    I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!

    0 讨论(0)
  • 2020-11-21 05:12
    SELECT o.*
    FROM `DocumentStatusLogs` o                   
      LEFT JOIN `DocumentStatusLogs` b                   
      ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
     WHERE b.DocumentID is NULL ;
    

    If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID

    0 讨论(0)
  • 2020-11-21 05:13

    This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.

    Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.

    SELECT 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
    FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
    OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (SELECT 
            [Extent2].[ID] AS [ID], 
            [Extent2].[DocumentID] AS [DocumentID], 
            [Extent2].[Status] AS [Status], 
            [Extent2].[DateCreated] AS [DateCreated]
            FROM [dbo].[DocumentStatusLogs] AS [Extent2]
            WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC) AS [Limit1]
    

    Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

    0 讨论(0)
  • 2020-11-21 05:14

    This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):

    SELECT T.doc_id, T.status, T.date_created FROM 
    (
        SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
    ) T
    WHERE T.rnk = 1;
    
    0 讨论(0)
  • 2020-11-21 05:15

    This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:

    Select distinct DocumentID
      , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
      , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
    From DocumentStatusLogs
    

    This should work in Sql Server 2008 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.

    0 讨论(0)
提交回复
热议问题