Delete duplicate rows from table with no unique key

后端 未结 5 442
误落风尘
误落风尘 2021-01-13 09:46

How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so

5条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-13 10:25

    Single SQL statement

    Here is a solution that deletes duplicates in place:

    DELETE FROM releases_labels r
    WHERE  EXISTS (
       SELECT 1
       FROM   releases_labels r1
       WHERE  r1 = r
       AND    r1.ctid < r.ctid
       );
    

    Since there is no unique key I am (ab)using the tuple ID ctid for the purpose. The physically first row survives in each set of dupes.

    • In-order sequence generation
    • How do I (or can I) SELECT DISTINCT on multiple columns?

    ctid is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expression r1 = r, only visible columns are compared (not the ctid or others). That's why the whole row can be equal and one ctid is still smaller than the other.

    With only few duplicates, this is also the fastest of all solutions.
    With lots of duplicates other solutions are faster.

    Then I suggest:

    ALTER TABLE discogs.releases_labels ADD COLUMN releases_labels_id serial PRIMARY KEY;
    

    Why does it work with NULL values?

    This is somewhat surprising. The reason is explained in the chapter Composite Type Comparison in the manual:

    The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.

    Bold emphasis mine.

    Alternatives with second table

    I removed that section, because the solution with a data-modifying CTE provided by @Nick is better.

提交回复
热议问题