I\'m looking for a schema-independent query. That is, if I have a users
table or a purchases
table, the query should be equally capable of catchin
I have done this using CTEs in SQL Server.
Here is a sample on how to delete dupes but you should be able to adapt it easily to find dupes:
WITH CTE (COl1, Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO
Here is a link to an article where I got the SQL:
http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/
I'm still confused about what "detecting them might be" but I'll give it a shot.
Excluding them is easy
e.g.
SELECT DISTINCT * FROM USERS
However if you wanted to only include them and a duplicate is all the fields than you have to do
SELECT
[Each and every field]
FROM
USERS
GROUP BY
[Each and every field]
HAVING COUNT(*) > 1
You can't get away with just using (*) because you can't GROUP BY *
so this requirement from your comments is difficult
a schema-independent means I don't want to specify all of the columns in the query
Unless that is you want to use dynamic SQL and read the columns from sys.columns
or information_schema.columns
For example
DECLARE @colunns nvarchar(max)
SET @colunns = ''
SELECT @colunns = @colunns + '[' + COLUMN_NAME +'], '
FROM INFORMATION_SCHEMA.columns
WHERE table_name = 'USERS'
SET @colunns = left(@colunns,len(@colunns ) - 1)
DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT ' + @colunns
+ 'FROM USERS' + 'GROUP BY '
+ @colunns
+ ' Having Count(*) > 1'
exec sp_executesql @SQL
Please note you should read this The Curse and Blessings of Dynamic SQL if you haven't already
I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:
;WITH cte AS (
SELECT
*,
CHECKSUM(*) AS chksum,
ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
FROM
My_Table
)
SELECT
*
FROM
CTE T1
INNER JOIN CTE T2 ON
T2.chksum = T1.chksum AND
T2.row_num <> T1.row_num
The ROW_NUMBER()
is needed so that you have some way of distinguishing rows. It requires an ORDER BY
and that can't be a constant, so GETDATE()
was my workaround for that.
Simply change the table name in the CTE and it should work without spelling out the columns.
I recently was looking into the same issue and noticed this question. I managed to solve it using a stored procedure with some dynamic SQL. This way you only need to specify the table name. And it will get all the other relevant data from sys tables.
/*
This SP returns all duplicate rows (1 line for each duplicate) for any given table.
to use the SP:
exec [database].[dbo].[sp_duplicates]
@table = '[database].[schema].[table]'
*/
create proc dbo.sp_duplicates @table nvarchar(50) as
declare @query nvarchar(max)
declare @groupby nvarchar(max)
set @groupby = stuff((select ',' + [name]
FROM sys.columns
WHERE object_id = OBJECT_ID(@table)
FOR xml path('')), 1, 1, '')
set @query = 'select *, count(*)
from '+@table+'
group by '+@groupby+'
having count(*) > 1'
exec (@query)