Postgres UPDATE with ORDER BY, how to do it?

前端 未结 2 1286
青春惊慌失措
青春惊慌失措 2021-01-01 19:08

I need to do a Postgres update on a collection of records & I\'m trying to prevent a deadlock which appeared in the stress tests.

The typical resolution to this

相关标签:
2条回答
  • 2021-01-01 19:43

    In general, concurrency is difficult. Especially with 200 statements (i'm assuming you don't only query = SELECT) or even transactions (actually every single statement issued is wrapped into a transaction if it's not in a transaction already).

    The general solution concepts are (a combination of) these:

    1. To be aware that deadlocks can happen, catch them in the application, check the Error Codes for class 40 or 40P01 and retry the transaction.

    2. Reserve locks. Use SELECT ... FOR UPDATE. Evade explicit locks as long as possible. Locks will force other transactions to wait for lock release, which harms concurrency, but can prevent transactions running into deadlocks. Check the example for deadlocks in chapter 13. Especially the one in which transaction A waits for B and B waits for A (the bank account thingy).

    3. Choose a different Isolation Level, for example a weaker one like READ COMMITED, if possible. Be aware of LOST UPDATEs in READ COMMITED mode. Prevent them with REPEATABLE READ.

    Write your statements with locks in the same order in EVERY transaction, for example by table name alphabetically.

    LOCK / USE A  -- Transaction 1 
    LOCK / USE B  -- Transaction 1
    LOCK / USE C  -- Transaction 1
    -- D not used -- Transaction 1
    
    -- A not used -- Transaction 2
    LOCK / USE B  -- Transaction 2
    -- C not used -- Transaction 2
    LOCK / USE D  -- Transaction 2
    

    with the general locking order A B C D. This way, the transactions can interleave in any relative order and still have a good chance not to deadlock (depending on your statements you may have other serialization issues though). The statements of the transactions will run in the order specified by them, but it can be that transaction 1 runs their first 2, then xact 2 runs the first one, then 1 finishes and finally xact 2 finishes.

    Also, you should realise that a statement involving multiple rows is not executed atomically in a concurrent situation. In other words, if you have two statements A and B involving multiple rows, then they can be executed in this order:

    a1 b1 a2 a3 a4 b2 b3     
    

    but NOT as a block of a's followed by b's. The same applies to a statement with a sub-query. Have you looked at the query plans using EXPLAIN ?

    In your case, you can try

    UPDATE BALANCES WHERE ID IN (
     SELECT ID FROM some_function() FOR UPDATE  -- LOCK using FOR UPDATE 
     -- other transactions will WAIT / BLOCK temporarily on conc. write access
    );
    

    If possible by what you want to do, you can also use SELECT ... FOR UPDATE SKIP LOCK, which will skip already locked data to get back concurrency, which is lost by WAITing for another transaction to release a lock (FOR UPDATE). But this will not apply an UPDATE to locked rows, which your application logic might require. So run that later on (see point 1).

    Also read LOST UPDATE about the LOST UPDATE and SKIP LOCKED about SKIP LOCKED. A queue might be an idea in your case, which is explained perfectly in the SKIP LOCKED reference, although relational DBMS are not meant to be queues.

    HTH

    0 讨论(0)
  • 2021-01-01 19:52

    As far as I know, there's no way to accomplish this directly through the UPDATE statement; the only way to guarantee lock order is to explicitly acquire locks with a SELECT ... ORDER BY ID FOR UPDATE, e.g.:

    UPDATE Balances
    SET Balance = 0
    WHERE ID IN (
      SELECT ID FROM Balances
      WHERE ID IN (SELECT ID FROM some_function())
      ORDER BY ID
      FOR UPDATE
    )
    

    This has the downside of repeating the ID index lookup on the Balances table. In your simple example, you can avoid this overhead by fetching the physical row address (represented by the ctid system column) during the locking query, and using that to drive the UPDATE:

    UPDATE Balances
    SET Balance = 0
    WHERE ctid = ANY(ARRAY(
      SELECT ctid FROM Balances
      WHERE ID IN (SELECT ID FROM some_function())
      ORDER BY ID
      FOR UPDATE
    ))
    

    (Be careful when using ctids, as the values are transient. We're safe here, as the locks will block any changes.)

    Unfortunately, the planner will only utilise the ctid in a narrow set of cases (you can tell if it's working by looking for a "Tid Scan" node in the EXPLAIN output). To handle more complicated queries within a single UPDATE statement, e.g. if your new balance was being returned by some_function() alongside the ID, you'll need to fall back to the ID-based lookup:

    UPDATE Balances
    SET Balance = Locks.NewBalance
    FROM (
      SELECT Balances.ID, some_function.NewBalance
      FROM Balances
      JOIN some_function() ON some_function.ID = Balances.ID
      ORDER BY Balances.ID
      FOR UPDATE
    ) Locks
    WHERE Balances.ID = Locks.ID
    

    If the performance overhead is an issue, you'd need to resort to using a cursor, which would look something like this:

    DO $$
    DECLARE
      c CURSOR FOR
        SELECT Balances.ID, some_function.NewBalance
        FROM Balances
        JOIN some_function() ON some_function.ID = Balances.ID
        ORDER BY Balances.ID
        FOR UPDATE;
    BEGIN
      FOR row IN c LOOP
        UPDATE Balances
        SET Balance = row.NewBalance
        WHERE CURRENT OF c;
      END LOOP;
    END
    $$
    
    0 讨论(0)
提交回复
热议问题