Ignoring errors in concurrent insertions

问题

I have a string vector data containing items that I want to insert into a table named foos. It's possible that some of the elements in data already exist in the table, so I must watch out for those.

The solution I'm using starts by transforming the data vector into virtual table old_and_new; it then builds virtual table old which contains the elements which are already present in foos; then, it constructs virtual table new with the elements which are really new. Finally, it inserts the new elements in table foos.

WITH   old_and_new AS (SELECT unnest ($data :: text[]) AS foo),
       old AS (SELECT foo FROM foos INNER JOIN old_and_new USING (foo)),
       new AS (SELECT * FROM old_and_new EXCEPT SELECT * FROM old)
INSERT INTO foos (foo) SELECT foo FROM new

This works fine in a non-concurrent setting, but fails if concurrent threads try to insert the same new element at the same time. I know I can solve this by setting the isolation level to serializable, but that's very heavy-handed.

Is there some other way I can solve this problem? If only there was a way to tell PostgreSQL that it was safe to ignore INSERT errors...

回答1:

Whatever your course of action is (@Denis gave you quite a few options), this rewritten INSERT command will be much faster:

INSERT INTO foos (foo)
SELECT n.foo
FROM   unnest ($data::text[]) AS n(foo)
LEFT   JOIN foos o USING (foo)
WHERE  o.foo IS NULL

It also leaves a much smaller time frame for a possible race condition.
In fact, the time frame should be so small, that unique violations should only be popping up under heavy concurrent load or with huge arrays.

Dupes in the array?

Except, if you your problem is built-in. Do you have duplicates in the input array itself? In this case, transaction isolation is not going to help you. The enemy is within!

Consider this example / solution:

INSERT INTO foos (foo)
SELECT n.foo
FROM  (SELECT DISTINCT foo FROM unnest('{foo,bar,foo,baz}'::text[]) AS foo) n
LEFT   JOIN foos o USING (foo)
WHERE  o.foo IS NULL

I use DISTINCT in the subquery to eliminate the "sleeper agents", a.k.a. duplicates.

People tend to forget that the dupes may come within the import data.

Full automation

This function is one way to deal with concurrency for good. If a UNIQUE_VIOLATION occurs, the INSERT is just retried. The newly present rows are excluded from the new attempt automatically.

It does not take care of the opposite problem, that a row might have been deleted concurrently - this would not get re-inserted. One might argue, that this outcome is ok, since such a DELETE happened concurrently. If you want to prevent this, make use of SELECT ... FOR SHARE to protect rows from concurrent DELETE.

CREATE OR REPLACE FUNCTION f_insert_array(_data text[], OUT ins_ct int) AS
$func$
BEGIN

LOOP
   BEGIN

   INSERT INTO foos (foo)
   SELECT n.foo
   FROM  (SELECT DISTINCT foo FROM unnest(_data) AS foo) n
   LEFT   JOIN foos o USING (foo)
   WHERE  o.foo IS NULL;

   GET DIAGNOSTICS ins_ct = ROW_COUNT;
   RETURN;

   EXCEPTION WHEN UNIQUE_VIOLATION THEN     -- tag.tag has UNIQUE constraint.
      RAISE NOTICE 'It actually happened!'; -- hardly ever happens
   END;
END LOOP;

END
$func$
  LANGUAGE plpgsql;

I made the function return the count of inserted rows, which is completely optional.

-> SQLfiddle demo

回答2:

Is there some other way I can solve this problem?

There are plenty, but none are a panacea...

You can't lock for inserts like you can do a select for update, since the rows don't exist yet.

You can lock the entire table, but that's even heavier handed that serializing your transactions.

You can use advisory locks, but be super wary about deadlocks. Sort new keys so as to obtain the locks in a consistent, predictable order. (Someone more knowledgeable with PG's source code will hopefully chime in, but I'm guessing that the predicate locks used in the serializable isolation level amount to doing precisely that.)

In pure sql you could also use a do statement to loop through the rows one by one, and trap the errors as they occur:

http://www.postgresql.org/docs/9.2/static/sql-do.html
http://www.postgresql.org/docs/9.2/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING

Similarly, you could create a convoluted upsert function and call it once per piece of data...

If you're building $data at the app level, you could run the inserts one by one and ignore errors.

And I'm sure I forgot some additional options...

回答3:

I like both Erwin and Denis' answers, but another approach might be to have concurrent sessions performing the unnesting and loading into a separate temporary table, and optionally eliminating what duplicates they can against the target table, and having a single session selecting from this temporary table, resolving temp table internal duplicates in an appropriate manner, inserting to the target table checking again for existing values, and deleting the selected temporary table records (in the same query using a common table expression).

This would be more batch oriented, in the style of a data warehouse extraction-load-transform paradigm, but would guarantee that no unique constraint issues would need to be dealt with.

Other advantages/disadvantages apply, such as decoupling the final insert from the data gathering (possible advantage), and needing to vacuum the temp table frequently (possible disadvantage), which may not be relevant to Jon's case, but might be useful info to others in the same situation.

来源：https://stackoverflow.com/questions/16359900/ignoring-errors-in-concurrent-insertions

标签

sql

postgresql

concurrency

duplicate-removal

sql-insert