How to UPSERT (MERGE, INSERT … ON DUPLICATE UPDATE) in PostgreSQL?

前端 未结 6 1277
灰色年华
灰色年华 2020-11-21 05:42

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the

相关标签:
6条回答
  • 2020-11-21 05:55

    SQLAlchemy upsert for Postgres >=9.5

    Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.

    Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)

    SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():

    Copying from the documentation:

    from sqlalchemy.dialects.postgresql import insert
    
    stmt = insert(my_table).values(user_email='a@b.com', data='inserted data')
    stmt = stmt.on_conflict_do_update(
        index_elements=[my_table.c.user_email],
        index_where=my_table.c.user_email.like('%@gmail.com'),
        set_=dict(data=stmt.excluded.data)
        )
    conn.execute(stmt)
    

    http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert

    0 讨论(0)
  • 2020-11-21 05:56

    I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:

    do $$
    begin 
      insert into testtable(id, somedata) values(2,'Joe');
    exception when unique_violation then
      update testtable set somedata = 'Joe' where id = 2;
    end $$;
    

    Note that this solution can be applied only if there are no deletions of rows of the table.

    I do not know about the efficiency of this solution, but it seems to me reasonable enough.

    0 讨论(0)
  • 2020-11-21 05:56

    Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.

    First the imports

    import itertools as it
    
    from functools import partial
    from operator import itemgetter
    
    from sqlalchemy.exc import IntegrityError
    from app import session
    from models import Posts
    

    Now a couple helper functions

    def chunk(content, chunksize=None):
        """Groups data into chunks each with (at most) `chunksize` items.
        https://stackoverflow.com/a/22919323/408556
        """
        if chunksize:
            i = iter(content)
            generator = (list(it.islice(i, chunksize)) for _ in it.count())
        else:
            generator = iter([content])
    
        return it.takewhile(bool, generator)
    
    
    def gen_resources(records):
        """Yields a dictionary if the record's id already exists, a row object 
        otherwise.
        """
        ids = {item[0] for item in session.query(Posts.id)}
    
        for record in records:
            is_row = hasattr(record, 'to_dict')
    
            if is_row and record.id in ids:
                # It's a row but the id already exists, so we need to convert it 
                # to a dict that updates the existing record. Since it is duplicate,
                # also yield True
                yield record.to_dict(), True
            elif is_row:
                # It's a row and the id doesn't exist, so no conversion needed. 
                # Since it's not a duplicate, also yield False
                yield record, False
            elif record['id'] in ids:
                # It's a dict and the id already exists, so no conversion needed. 
                # Since it is duplicate, also yield True
                yield record, True
            else:
                # It's a dict and the id doesn't exist, so we need to convert it. 
                # Since it's not a duplicate, also yield False
                yield Posts(**record), False
    

    And finally the upsert function

    def upsert(data, chunksize=None):
        for records in chunk(data, chunksize):
            resources = gen_resources(records)
            sorted_resources = sorted(resources, key=itemgetter(1))
    
            for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
                items = [g[0] for g in group]
    
                if dupe:
                    _upsert = partial(session.bulk_update_mappings, Posts)
                else:
                    _upsert = session.add_all
    
                try:
                    _upsert(items)
                    session.commit()
                except IntegrityError:
                    # A record was added or deleted after we checked, so retry
                    # 
                    # modify accordingly by adding additional exceptions, e.g.,
                    # except (IntegrityError, ValidationError, ValueError)
                    db.session.rollback()
                    upsert(items)
                except Exception as e:
                    # Some other error occurred so reduce chunksize to isolate the 
                    # offending row(s)
                    db.session.rollback()
                    num_items = len(items)
    
                    if num_items > 1:
                        upsert(items, num_items // 2)
                    else:
                        print('Error adding record {}'.format(items[0]))
    

    Here's how you use it

    >>> data = [
    ...     {'id': 1, 'text': 'updated post1'}, 
    ...     {'id': 5, 'text': 'updated post5'}, 
    ...     {'id': 1000, 'text': 'new post1000'}]
    ... 
    >>> upsert(data)
    

    The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

    0 讨论(0)
  • WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2 
    RETURNING ID),
    INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
    INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
    

    Tested on Postgresql 9.3

    0 讨论(0)
  • 2020-11-21 06:12

    9.5 and newer:

    PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.

    Comparison with ON DUPLICATE KEY UPDATE.

    Quick explanation.

    For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.

    Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.

    The commit adding the feature is here and the discussion around its development is here.


    If you're on 9.5 and don't need to be backward-compatible you can stop reading now.


    9.4 and older:

    PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.

    This article discusses the problem in useful detail.

    In general you must choose between two options:

    • Individual insert/update operations in a retry loop; or
    • Locking the table and doing batch merge

    Individual row retry loop

    Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.

    The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.

    This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.

    Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.

    Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:

    -- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
    
    BEGIN;
    
    UPDATE testtable
    SET somedata = 'blah'
    WHERE id = 2;
    
    -- Remember, this is WRONG. Do NOT COPY IT.
    
    INSERT INTO testtable (id, somedata)
    SELECT 2, 'blah'
    WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
    
    COMMIT;
    

    then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.

    This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.

    Please don't roll your own solution for this. Like with message queuing, it's probably wrong.

    Bulk upsert with lock

    Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.

    In this case, you typically follow the following process:

    • CREATE a TEMPORARY table

    • COPY or bulk-insert the new data into the temp table

    • LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.

    • Do an UPDATE ... FROM of existing records using the values in the temp table;

    • Do an INSERT of rows that don't already exist in the target table;

    • COMMIT, releasing the lock.

    For example, for the example given in the question, using multi-valued INSERT to populate the temp table:

    BEGIN;
    
    CREATE TEMPORARY TABLE newvals(id integer, somedata text);
    
    INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
    
    LOCK TABLE testtable IN EXCLUSIVE MODE;
    
    UPDATE testtable
    SET somedata = newvals.somedata
    FROM newvals
    WHERE newvals.id = testtable.id;
    
    INSERT INTO testtable
    SELECT newvals.id, newvals.somedata
    FROM newvals
    LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
    WHERE testtable.id IS NULL;
    
    COMMIT;
    

    Related reading

    • UPSERT wiki page
    • UPSERTisms in Postgres
    • Insert, on duplicate update in PostgreSQL?
    • http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
    • Upsert with a transaction
    • Is SELECT or INSERT in a function prone to race conditions?
    • SQL MERGE on the PostgreSQL wiki
    • Most idiomatic way to implement UPSERT in Postgresql nowadays

    What about MERGE?

    SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.

    It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.

    Other DBs:

    • INSERT ... ON DUPLICATE KEY UPDATE in MySQL
    • MERGE from MS SQL Server (but see above about MERGE problems)
    • MERGE from Oracle (but see above about MERGE problems)
    0 讨论(0)
  • 2020-11-21 06:14

    Here are some examples for insert ... on conflict ... (pg 9.5+) :

    • Insert, on conflict - do nothing.
      insert into dummy(id, name, size) values(1, 'new_name', 3)
      on conflict do nothing;`  
      
    • Insert, on conflict - do update, specify conflict target via column.
      insert into dummy(id, name, size) values(1, 'new_name', 3)
      on conflict(id)
      do update set name = 'new_name', size = 3;  
      
    • Insert, on conflict - do update, specify conflict target via constraint name.
      insert into dummy(id, name, size) values(1, 'new_name', 3)
      on conflict on constraint dummy_pkey
      do update set name = 'new_name', size = 4;
      
    0 讨论(0)
提交回复
热议问题