Postgresql slow delete from where exists

问题

I'm having trouble with slow delete queries. I have a schema ,say "target" containing tables that all have an equivalent table (identical columns & primary keys) in another one, say "delta". I now want to delete all rows that appear in the delta schema from the target schema. I have tried this using the DELETE FROM WHERE EXISTS approach, but that seems incredibly slow. Here's an example query:

DELETE FROM "target".name2phoneme
WHERE EXISTS(
  SELECT 1 FROM delta.name2phoneme d 
  WHERE name2phoneme.NAME_ID = d.NAME_ID 
  AND name2phoneme.PHONEME_ID = d.PHONEME_ID
);

This is the layout of both tables (whith the exception that the "delta" schema only has primary keys and no foreign keys)

CREATE TABLE name2phoneme
(
  name_id uuid NOT NULL,
  phoneme_id uuid NOT NULL,
  seq_num numeric(3,0),
  CONSTRAINT pk_name2phoneme PRIMARY KEY (name_id, phoneme_id),
  CONSTRAINT fk_name2phoneme_name_id_2_name FOREIGN KEY (name_id)
    REFERENCES name (name_id) MATCH SIMPLE
    ON UPDATE NO ACTION
    ON DELETE NO ACTION
    DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT fk_name2phoneme_phoneme_id_2_phoneme FOREIGN KEY (phoneme_id)
    REFERENCES phoneme (phoneme_id) MATCH SIMPLE
    ON UPDATE NO ACTION
    ON DELETE NO ACTION
    DEFERRABLE INITIALLY DEFERRED
)

The "target" table originally contains a little over 18M rows, while the delta table contains about 3.7M rows (that are to be deleted from the target).

Here's the output of EXPLAIN of the above query:

"Delete on name2phoneme  (cost=154858.03..1068580.46 rows=6449114 width=12)"
"  ->  Hash Join  (cost=154858.03..1068580.46 rows=6449114 width=12)"
"        Hash Cond: ((name2phoneme.name_id = d.name_id) AND (name2phoneme.phoneme_id = d.phoneme_id))"
"        ->  Seq Scan on name2phoneme  (cost=0.00..331148.16 rows=18062616 width=38)"
"        ->  Hash  (cost=69000.01..69000.01 rows=3763601 width=38)"
"              ->  Seq Scan on name2phoneme d  (cost=0.00..69000.01 rows=3763601 width=38)"

I tried to EXPLAIN ANALYZE the above query, but execution took over 2hrs so I killed it.

Any ideas on how I can optimize this operation?

回答1:

Deleting 3.7 million rows is very time consuming, because of the overhead of looking up each row and then logging and deleting the rows. Just thinking about all the dirty pages, logging, and cache misses is mind-boggling -- not to mention updates to the indexes as well.

For that reason, something like this can be much faster:

create temporary table temp_n2p as 
    select n2p.*
    from "target".name2phoneme n2p
    where not exists (select 1
                      from delta.name2phoneme d 
                      where n2p.NAME_ID = d.NAME_ID and
                            n2p.PHONEME_ID = d.PHONEME_ID
                     );

truncate table "target".name2phoneme;

insert into "target".name2phoneme
    select *
    from temp_n2p;

You should also drop the indexes before the truncation and then recreate them afterwards.

回答2:

Have you tried either of these approaches:

DELETE 
FROM "target".name2phoneme t  
     USING delta.name2phoneme d 
WHERE t.NAME_ID = d.NAME_ID 
      AND t.PHONEME_ID = d.PHONEME_ID               
;

Or using WITH, but Postgres does materialize CTEs so I'm not confident this is wise at your scale of need.

WITH cte AS (
      SELECT t.name_id, t.phoneme_id
      FROM "target".name2phoneme t  
      INNER JOIN delta.name2phoneme d ON t.NAME_ID = d.NAME_ID 
                            AND t.PHONEME_ID = d.PHONEME_ID               
      )
DELETE FROM "target".name2phoneme t
     USING cte d
WHERE t.NAME_ID = d.NAME_ID 
      AND t.PHONEME_ID = d.PHONEME_ID               
;

来源：https://stackoverflow.com/questions/47402098/postgresql-slow-delete-from-where-exists

标签

sql

postgresql

sql-delete