Delete duplicate rows from table with no unique key

后端未结

关注

 5  440

How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so

相关标签:

5条回答

再見小時候

2021-01-13 10:17

You can try like this:

CREATE TABLE temp 
INSERT INTO temp SELECT DISTINCT * FROM discogs.releases_labels;
DROP TABLE discogs.releases_labels;
ALTER TABLE temp RENAME TO discogs.releases_labels;

0 讨论(0)

小蘑菇

2021-01-13 10:22
As you have no primary key, there is no easy way to distinguish one duplicated line from any other one. That's one of the reasons why it is highly recommended that any table have a primary key (*).

So you are left with only 2 solutions :
- use a temporary table as suggested by Rahul (IMHO the simpler and cleaner way) (**)
- use procedural SQL and a cursor either from a procedural language such as Python or [put here your prefered language] or with PL/pgSQL. Something like (beware untested) :
```
CREATE OR REPLACE FUNCTION deduplicate() RETURNS integer AS $$
DECLARE
 curs CURSOR FOR SELECT * FROM releases_labels ORDER BY label, release_id, catno;
 r releases_labels%ROWTYPE;
 old releases_labels%ROWTYPE;
 n integer;
BEGIN
 n := 0;
 old := NULL;
 FOR rec IN curs LOOP
  r := rec;
  IF r = old THEN
   DELETE FROM releases_labels WHERE CURRENT OF curs;
   n := n + 1;
  END IF;
  old := rec;
 END LOOP;
 RETURN n;
END;
$$ LANGUAGE plpgsql;

SELECT deduplicate();
```
  should delete duplicate lines and return the number of lines actually deleted. It is not necessarily the most efficient way, but you only touch rows that need to be deleted so you will not have to lock whole table.
(*) hopefully PostgreSQL offers the ctid pseudo column that you can use as a key. If you table contains an oid column, you can also use it as it will never change.

(**) PostgreSQL WITH allows you to do that in in single SQL statement

This two points from answer from Nick Barnes
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-01-13 10:25
If you can afford to rewrite the whole table, this is probably the simplest approach:
```
WITH Deleted AS (
  DELETE FROM discogs.releases_labels
  RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted
```
If you need to specifically target the duplicated records, you can make use of the internal ctid field, which uniquely identifies a row:
```
DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM discogs.releases_labels
  GROUP BY label, release_id, catno
)
```
Be very careful with ctid; it changes over time. But you can rely on it staying the same within the scope of a single statement.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2021-01-13 10:25
Single SQL statement

Here is a solution that deletes duplicates in place:
```
DELETE FROM releases_labels r
WHERE  EXISTS (
   SELECT 1
   FROM   releases_labels r1
   WHERE  r1 = r
   AND    r1.ctid < r.ctid
   );
```
Since there is no unique key I am (ab)using the tuple ID ctid for the purpose. The physically first row survives in each set of dupes.
- In-order sequence generation
- How do I (or can I) SELECT DISTINCT on multiple columns?
ctid is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expression r1 = r, only visible columns are compared (not the ctid or others). That's why the whole row can be equal and one ctid is still smaller than the other.

With only few duplicates, this is also the fastest of all solutions.
With lots of duplicates other solutions are faster.

Then I suggest:
```
ALTER TABLE discogs.releases_labels ADD COLUMN releases_labels_id serial PRIMARY KEY;
```
Why does it work with NULL values?

This is somewhat surprising. The reason is explained in the chapter Composite Type Comparison in the manual:

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.

Bold emphasis mine.

Alternatives with second table

I removed that section, because the solution with a data-modifying CTE provided by @Nick is better.
0 讨论(0)
发布评论:

提交评论
- 加载中...

借酒劲吻你

2021-01-13 10:43

Since you also need to avoid duplicates in the future, you could add a surrogate key and a unique constraint while dedupping:

-- add surrogate key
ALTER TABLE releases_labels
        ADD column id SERIAL NOT NULL PRIMARY KEY
        ;

-- verify
SELECT * FROM releases_labels;

DELETE FROM releases_labels dd
WHERE EXISTS (SELECT *
        FROM releases_labels x
        WHERE x.label = dd.label
        AND x.release_id = dd.release_id
        AND x.catno = dd.catno
        AND x.id < dd.id
        );

-- verify
SELECT * FROM releases_labels;

-- add unique constraint for the natural key
ALTER TABLE releases_labels
        ADD UNIQUE (label,release_id,catno)
        ;

-- verify
SELECT * FROM releases_labels;

0 讨论(0)

Delete duplicate rows from table with no unique key

Single SQL statement

Why does it work with NULL values?

Alternatives with second table