问题
I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt
are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?
回答1:
DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no
, choosing the smallest registration_no
if there are multiple candidates.
NULL
sorts last in default ascending order. So any row with a not-null registration_no
is preferred.
If you want the greatest registration_no
instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
- Select first row in each GROUP BY group?
- Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY
) or other UNIQUE NOT NULL
(combination of) column(s) you can use for this purpose, you can fall back to ctid
. See:
- How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN
is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
- How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
- How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;
来源:https://stackoverflow.com/questions/63005307/how-to-break-ties-when-comparing-columns-in-sql