How can I speed up this PostgreSQL UPDATE FROM sql query? It currently takes days to finish running

问题

How can I speed up the PostgreSQL UPDATE FROM sql query below? It currently takes days to finish running.

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND (ip.status is null or ip.status != '6');

And why does it takes days to run in the first place?

Most of the time, I manually kill the query because it takes too long to run like more than 24 hours. Last time it successfully finished running, it took almost 38 hours.

import_parts table has 971971 rows

parts.part_parts table has 2196357 rows

parts.part_parts table has an index on upc and id is the primary key of the table.

I already tried running VACUUM ANALYZE on import_parts table and parts.part_parts table before the update query above runs but the query still takes too long to run, so I manually killed it after 30 minutes. I'm hoping to be able to run the query in under 30 minutes.

Here's the result of EXPLAIN when I run the query after running VACUUM ANALYZE on import_parts table and parts.part_parts table:

UPDATE 1:

I also tried setting enable_nestloop to off: SET enable_nestloop TO off

But the query still takes too long to run so I manually killed it. Here's the result of EXPLAIN when enable_nestloop is turned off:

UPDATE 2:

Here's the result of EXPLAIN when using the query suggested by Abelisto on his answer to this post:

When I actually run the query though, I'm encountering this error:

ERROR: more than one row returned by a subquery used as an expression

I'm still figuring out how to fix the error.

回答1:

First of all, try to rewrite your query like

UPDATE import_parts ip
SET part_part_id = (
  SELECT pp.id
  FROM parts.part_parts pp
  WHERE pp.upc = ip.upc)
WHERE status is null or status != '6';

Obviously it raises something like to

ERROR:  more than one row returned by a subquery used as an expression

Fix it using additionally conditions (subquery should to return exactly one or zero row for each row in the target table)

回答2:

From what you say, it seems that upc is not unique in parts_parts. Try running this:

select upc, count(*)
from parts.parts_parts pp
group by upc
having count(*) > 1;

These duplicates are probably causing the performance problems. You could get around this by arbitrarily choosing a value, such as:

UPDATE import_parts ip
  SET part_part_id = pp.id
  FROM (SELECT pp.upc, MIN(pp.id) as id
        FROM parts.part_parts pp
        GROUP BY pp.upc
       ) pp
  WHERE pp.upc = ip.upc AND (ip.status is null or ip.status <> '6');

回答3:

Create an index with in import_parts with columns: upc,status.
I will recomend you to split in two sentences:

I do't know your status, but i suppose you have status: null, 1, 2, 3, 4, 5, 6, 7

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND ip.status is null
;

UPDATE import_parts ip
SET part_part_id = pp.id
FROM parts.part_parts pp
WHERE pp.upc = ip.upc
AND ip.status IN(1, 2, 3, 4, 5, 7)
;

Of course you need to change 1, 2, 3, 4, 5, 7 for your values(different from 6)
I also like the answer of @Gordon Linoff, but it depends of how many rows do you have by upc

来源：https://stackoverflow.com/questions/62493528/how-can-i-speed-up-this-postgresql-update-from-sql-query-it-currently-takes-day

标签

sql

postgresql

sqlperformance

postgresql-performance