SQL query to find duplicate rows, in any table

后端未结

关注

 4  1887

I\'m looking for a schema-independent query. That is, if I have a users table or a purchases table, the query should be equally capable of catchin

相关标签:

4条回答

独厮守ぢ

2021-01-18 18:33
I have done this using CTEs in SQL Server.

Here is a sample on how to delete dupes but you should be able to adapt it easily to find dupes:
```
WITH CTE (COl1, Col2, DuplicateCount)
AS
(
    SELECT COl1,Col2,
    ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
    FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO
```
Here is a link to an article where I got the SQL:

http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2021-01-18 18:45
I'm still confused about what "detecting them might be" but I'll give it a shot.

Excluding them is easy

e.g.
```
SELECT DISTINCT * FROM USERS
```
However if you wanted to only include them and a duplicate is all the fields than you have to do
```
SELECT 
   [Each and every field]
FROM
   USERS
GROUP BY
   [Each and every field]
HAVING COUNT(*) > 1  
```
You can't get away with just using (*) because you can't GROUP BY * so this requirement from your comments is difficult

a schema-independent means I don't want to specify all of the columns in the query

Unless that is you want to use dynamic SQL and read the columns from sys.columns or information_schema.columns

For example
```
DECLARE @colunns nvarchar(max)
SET  @colunns = ''

SELECT @colunns = @colunns  + '[' +  COLUMN_NAME  +'], ' 
FROM INFORMATION_SCHEMA.columns  
WHERE table_name = 'USERS'

SET  @colunns  = left(@colunns,len(@colunns ) - 1)


DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT '  + @colunns 
          + 'FROM  USERS' + 'GROUP BY ' 
          + @colunns 
           + ' Having Count(*) > 1'


exec sp_executesql @SQL
```
Please note you should read this The Curse and Blessings of Dynamic SQL if you haven't already
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2021-01-18 18:46
I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:
```
;WITH cte AS (
    SELECT
        *,
        CHECKSUM(*) AS chksum,
        ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
    FROM
        My_Table
)
SELECT
    *
FROM
    CTE T1
INNER JOIN CTE T2 ON
    T2.chksum = T1.chksum AND
    T2.row_num <> T1.row_num
```
The ROW_NUMBER() is needed so that you have some way of distinguishing rows. It requires an ORDER BY and that can't be a constant, so GETDATE() was my workaround for that.

Simply change the table name in the CTE and it should work without spelling out the columns.
0 讨论(0)
发布评论:

提交评论
- 加载中...

花落未央

2021-01-18 18:50

I recently was looking into the same issue and noticed this question. I managed to solve it using a stored procedure with some dynamic SQL. This way you only need to specify the table name. And it will get all the other relevant data from sys tables.

/*
This SP returns all duplicate rows (1 line for each duplicate) for any given table.

to use the SP:
exec [database].[dbo].[sp_duplicates] 
    @table = '[database].[schema].[table]'  

*/
create proc dbo.sp_duplicates @table nvarchar(50) as

declare @query nvarchar(max)
declare @groupby nvarchar(max)

set @groupby =  stuff((select ',' + [name]
                FROM sys.columns
                WHERE object_id = OBJECT_ID(@table)
                FOR xml path('')), 1, 1, '')

set @query = 'select *, count(*)
                from '+@table+'
                group by '+@groupby+'
                having count(*) > 1'

exec (@query)

0 讨论(0)