How to break ties when comparing columns in SQL

问题

I am trying to delete duplicates in Postgres. I am using this as the base of my query:

DELETE FROM case_file as p
WHERE EXISTS (
    SELECT FROM case_file as p1
    WHERE p1.serial_no = p.serial_no
    AND p1.cfh_status_dt < p.cfh_status_dt
    );

It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.

For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).

Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?

回答1:

DELETE FROM case_file AS p
WHERE  id NOT IN (
   SELECT DISTINCT ON (serial_no) id  -- id = PK
   FROM   case_file 
   ORDER  BY serial_no, cfh_status_dt DESC, registration_no
   );

This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.

NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.

If you want the greatest registration_no instead, to still sort NULL values last, use:

   ...
   ORDER  BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST

See:

Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?

If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:

How to delete duplicate rows without unique identifier

If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.

Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:

How to delete duplicate entries?

Surviving rows are simply:

SELECT DISTINCT ON (serial_no) *
FROM   case_file 
ORDER  BY serial_no, cfh_status_dt DESC, registration_no;

来源：https://stackoverflow.com/questions/63005307/how-to-break-ties-when-comparing-columns-in-sql

标签

sql

postgresql

duplicates

greatest-n-per-group

sql-delete