How to find duplicate records in PostgreSQL

前端 未结 4 1425
长情又很酷
长情又很酷 2021-01-29 17:17

I have a PostgreSQL database table called \"user_links\" which currently allows the following duplicate fields:

year, user_id, sid, cid

The uni

相关标签:
4条回答
  • 2021-01-29 17:43

    In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.

    In order to find duplicate values you should run,

    SELECT year, COUNT(id)
    FROM YOUR_TABLE
    GROUP BY year
    HAVING COUNT(id) > 1
    ORDER BY COUNT(id);
    

    Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.

    DELETE
    FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
    WHERE A.year=B.year AND A.id<B.id;
    
    0 讨论(0)
  • 2021-01-29 17:44

    You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.

    select tn1.id,
           array_agg(tn2.id) as duplicate_entries, 
    from table_name tn1 join table_name tn2 on 
        tn1.year = tn2.year 
        and tn1.sid = tn2.sid 
        and tn1.user_id = tn2.user_id 
        and tn1.cid = tn2.cid
        and tn1.id <> tn2.id
    group by tn1.id;
    

    Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:

    with dupe_set as (
    select tn1.id,
           array_agg(tn2.id) as duplicate_entries, 
    from table_name tn1 join table_name tn2 on 
        tn1.year = tn2.year 
        and tn1.sid = tn2.sid 
        and tn1.user_id = tn2.user_id 
        and tn1.cid = tn2.cid
        and tn1.id <> tn2.id
    group by tn1.id
    order by tn1.id asc)
    select ds.id from dupe_set ds where not exists 
     (select de from unnest(ds.duplicate_entries) as de where de < ds.id)
    

    Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.

    0 讨论(0)
  • 2021-01-29 17:52

    From "Find duplicate rows with PostgreSQL" here's smart solution:

    select * from (
      SELECT id,
      ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
      FROM tbl
    ) dups
    where 
    dups.Row > 1
    
    0 讨论(0)
  • 2021-01-29 18:09

    The basic idea will be using a nested query with count aggregation:

    select * from yourTable ou
    where (select count(*) from yourTable inr
    where inr.sid = ou.sid) > 1
    

    You can adjust the where clause in the inner query to narrow the search.


    There is another good solution for that mentioned in the comments, (but not everyone reads them):

    select Column1, Column2, count(*)
    from yourTable
    group by Column1, Column2
    HAVING count(*) > 1
    

    Or shorter:

    SELECT (yourTable.*)::text, count(*)
    FROM yourTable
    GROUP BY yourTable.*
    HAVING count(*) > 1
    
    0 讨论(0)
提交回复
热议问题