SQL Query Find Exact and Near Dupes

问题

I have a SQL table with FirstName, LastName, Add1 and other fields. I am working to get this data cleaned up. There are a few instances of likely dupes -

All 3 columns are the exact same for more than 1 record
The First and Last are the same, only 1 has an address, the other is blank
The First and Last are similar (John | Doe vs John C. | Doe) and the address is the same or one is blank

I'm wanting to generate a query I can provide to the users, so they can check these records out, compare their related records and then delete the one they don't need.

I've been looking at similarity functions, soundex, and such, but it all seems so complicated. Is there an easy way to do this?

Thanks!

Edit:

So here is some sample data:

FirstName    |   LastName    |      Add1
John         |   Doe         |   1 Main St
John         |   Doe         |   
John A.      |   Doe         |   
Jane         |   Doe         |   2 Union Ave
Jane B.      |   Doe         |   2 Union Ave
Alex         |   Smith       |   3 Broad St
Chris        |   Anderson    |   4 South Blvd
Chris        |   Anderson    |   4 South Blvd

I really like Critical Error's query for identifying all different types of dupes. That would give me the above sample data, with the Alex Smith result not included, because there are no dupes for that.

What I want to do is take that result set and identify which are dupes for Jane Doe. She should only have 2 dupes. John Doe has 3, and Chris Anderson has 2. Can I get at that sub-result set?

Edit:

I figured it out! I will be marking Critical Error's answer as the solution, since it totally got me where I needed to go. Here is the solution, in case it might help others. Basically, this is what we are doing.

Selecting the records from the table where there are dupes
Adding a WHERE EXISTS sub-query to look in the same table for exact dupes, where the ID from the main query and sub-query do not match
Adding a WHERE EXISTS sub-query to look in the same table for similar dupes, using a Difference factor between duplicative columns, where the ID from the main query and sub-query do not match
Adding a WHERE EXISTS sub-query to look in the same table for dupes on 2 fields where a 3rd may be null for one of the records, where the ID from the main query and sub-query do not match
Each subquery is connected with an OR, so that any kind of duplicate is found
At the end of each sub-query add a nested requirement that either the main query or sub-query be the ID of the record you are looking to identify duplicates for.

    DECLARE @CID AS INT
    
    SET ANSI_NULLS ON
    
        SET NOCOUNT ON;
        SET @CID = 12345
    BEGIN
    
    
    SELECT
        *
    FROM @Customers c
    WHERE
        -- Exact duplicates.
        EXISTS (
    
            SELECT * FROM @Customers x WHERE
                x.FirstName = c.FirstName 
                AND x.LastName = c.LastName 
                AND x.Add1 = c.Add1 
                AND x.Id <> c.Id
                AND (x.ID = @CID OR c.ID = @CID)
        
        )
        -- Match First/Last name are same/similar and the address is same.
        OR EXISTS (
        
            SELECT * FROM @Customers x WHERE
                DIFFERENCE( x.FirstName, c.FirstName ) = 4 
                AND DIFFERENCE( x.LastName, c.LastName ) = 4
                AND x.Add1 = c.Add1
                AND x.Id <> c.Id
                AND (x.ID = @CID OR c.ID = @CID)
    
        )
        -- Match First/Last name and one address exists.
        OR EXISTS (
        
            SELECT * FROM @Customers x WHERE
                x.FirstName = c.FirstName 
                AND x.LastName = c.LastName
                AND x.Id <> c.Id
                AND (
                    x.Add1 IS NULL AND c.Add1 IS NOT NULL
                    OR
                    x.Add1 IS NOT NULL AND c.Add1 IS NULL
                )
                AND (x.ID = @CID OR c.ID = @CID)
        );

回答1:

Assuming you have a unique id between records, you can give this a try:

DECLARE @Customers table ( FirstName varchar(50), LastName varchar(50), Add1 varchar(50), Id int IDENTITY(1,1) );
INSERT INTO @Customers ( FirstName, LastName, Add1 ) VALUES
    ( 'John', 'Doe', '123 Anywhere Ln' ),
    ( 'John', 'Doe', '123 Anywhere Ln' ),
    ( 'John', 'Doe', NULL ),
    ( 'John C.', 'Doe', '123 Anywhere Ln' ),
    ( 'John C.', 'Doe', '15673 SW Liar Dr' );

SELECT
    *
FROM @Customers c
WHERE
    -- Exact duplicates.
    EXISTS (

        SELECT * FROM @Customers x WHERE
            x.FirstName = c.FirstName 
            AND x.LastName = c.LastName 
            AND x.Add1 = c.Add1 
            AND x.Id <> c.Id
    
    )
    -- Match First/Last name are same/similar and the address is same.
    OR EXISTS (
    
        SELECT * FROM @Customers x WHERE
            DIFFERENCE( x.FirstName, c.FirstName ) = 4 
            AND DIFFERENCE( x.LastName, c.LastName ) = 4
            AND x.Add1 = c.Add1
            AND x.Id <> c.Id

    )
    -- Match First/Last name and one address exists.
    OR EXISTS (
    
        SELECT * FROM @Customers x WHERE
            x.FirstName = c.FirstName 
            AND x.LastName = c.LastName
            AND x.Id <> c.Id
            AND (
                x.Add1 IS NULL AND c.Add1 IS NOT NULL
                OR
                x.Add1 IS NOT NULL AND c.Add1 IS NULL
            )
    );

Returns

+-----------+----------+-----------------+----+
| FirstName | LastName |      Add1       | Id |
+-----------+----------+-----------------+----+
| John      | Doe      | 123 Anywhere Ln |  1 |
| John      | Doe      | 123 Anywhere Ln |  2 |
| John      | Doe      | NULL            |  3 |
| John C.   | Doe      | 123 Anywhere Ln |  4 |
+-----------+----------+-----------------+----+

Initial resultset:

+-----------+----------+------------------+----+
| FirstName | LastName |       Add1       | Id |
+-----------+----------+------------------+----+
| John      | Doe      | 123 Anywhere Ln  |  1 |
| John      | Doe      | 123 Anywhere Ln  |  2 |
| John      | Doe      | NULL             |  3 |
| John C.   | Doe      | 123 Anywhere Ln  |  4 |
| John C.   | Doe      | 15673 SW Liar Dr |  5 |
+-----------+----------+------------------+----+

来源：https://stackoverflow.com/questions/64932557/sql-query-find-exact-and-near-dupes

标签

sql

sql-server

duplicates

similarity

fuzzy-comparison