I have a table which I want to get the latest entry for each group. Here\'s the table:
DocumentStatusLogs
Table
|ID| DocumentID | Status
Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
M$'s Management Studio: After highlighting and running the first block, highlight both Option 1 and Option 2, Right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!