SQL Find duplicate with several field (no unique ID) WORK AROUND

前端 未结 2 745
Happy的楠姐
Happy的楠姐 2021-01-24 04:12

I am trying to find duplicated vendors from a database using several fields from vendor table and vendor_address table. The thing is the more inner join I mak

2条回答
  •  北荒
    北荒 (楼主)
    2021-01-24 05:08

    Lets have some interesting data with chained duplicates on different attributes:

    CREATE TABLE data ( ID, A, B, C ) AS
      SELECT 1, 1, 1, 1 FROM DUAL UNION ALL -- Related to #2 on column A
      SELECT 2, 1, 2, 2 FROM DUAL UNION ALL -- Related to #1 on column A, #3 on B & C, #5 on C
      SELECT 3, 2, 2, 2 FROM DUAL UNION ALL -- Related to #2 on columns B & C, #5 on C
      SELECT 4, 3, 3, 3 FROM DUAL UNION ALL -- Related to #5 on column A
      SELECT 5, 3, 4, 2 FROM DUAL UNION ALL -- Related to #2 and #3 on column C, #4 on A
      SELECT 6, 5, 5, 5 FROM DUAL;          -- Unrelated
    

    Now, we can get some relationships using analytic functions (without any joins):

    SELECT d.*,
           LEAST(
             FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
           ) AS duplicate_of
    FROM   data d;
    

    Which gives:

    ID A B C DUPLICATE_OF
    -- - - - ------------
     1 1 1 1            1
     2 1 2 2            1
     3 2 2 2            2
     4 3 3 3            4
     5 3 4 2            2
     6 5 5 5            6
    

    But that doesn't pick up that #4 is related to #5 which is related to #2 and then to #1...

    This can be found with a hierarchical query:

    SELECT id, a, b, c,
           CONNECT_BY_ROOT( id ) AS duplicate_of
    FROM   data
    CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );
    

    But that will give many, many duplicate rows (since it does not know where to start the hierarchy from so will chose each row in turn as the root) - instead you can use the first query to give the hierarchical query a starting point when the ID and DUPLICATE_OF values are the same:

    SELECT id, a, b, c,
           CONNECT_BY_ROOT( id ) AS duplicate_of
    FROM   (
      SELECT d.*,
             LEAST(
               FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
               FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
               FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
             ) AS duplicate_of
      FROM   data d
    )
    START WITH id = duplicate_of
    CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );
    

    Which gives:

    ID A B C DUPLICATE_OF
    -- - - - ------------
     1 1 1 1            1
     2 1 2 2            1
     3 2 2 2            1
     4 3 3 3            1
     5 3 4 2            1
     1 1 1 1            4
     2 1 2 2            4
     3 2 2 2            4
     4 3 3 3            4
     5 3 4 2            4
     6 5 5 5            6
    

    There are still some rows are duplicated because of the local minima in the search that occurs a #4 ... which can be removed with a simple GROUP BY:

    SELECT id, a, b, c,
           MIN( duplicate_of ) AS duplicate_of
    FROM   (
      SELECT id, a, b, c,
             CONNECT_BY_ROOT( id ) AS duplicate_of
      FROM   (
        SELECT d.*,
               LEAST(
                 FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
                 FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
                 FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
               ) AS duplicate_of
        FROM   data d
      )
      START WITH id = duplicate_of
      CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c )
    )
    GROUP BY id, a, b, c;
    

    Which gives the output:

    ID A B C DUPLICATE_OF
    -- - - - ------------
     1 1 1 1            1
     2 1 2 2            1
     3 2 2 2            1
     4 3 3 3            1
     5 3 4 2            1
     6 5 5 5            6
    

提交回复
热议问题