psycopg2: insert multiple rows with one query

后端 未结 15 2311
谎友^
谎友^ 2020-11-22 09:11

I need to insert multiple rows with one query (number of rows is not constant), so I need to execute query like this one:

INSERT INTO t (a, b) VALUES (1, 2),         


        
相关标签:
15条回答
  • 2020-11-22 09:39

    Another nice and efficient approach - is to pass rows for insertion as 1 argument, which is array of json objects.

    E.g. you passing argument:

    [ {id: 18, score: 1}, { id: 19, score: 5} ]
    

    It is array, which may contain any amount of objects inside. Then your SQL looks like:

    INSERT INTO links (parent_id, child_id, score) 
    SELECT 123, (r->>'id')::int, (r->>'score')::int 
    FROM unnest($1::json[]) as r 
    

    Notice: Your postgress must be new enough, to support json

    0 讨论(0)
  • 2020-11-22 09:40

    I built a program that inserts multiple lines to a server that was located in another city.

    I found out that using this method was about 10 times faster than executemany. In my case tup is a tuple containing about 2000 rows. It took about 10 seconds when using this method:

    args_str = ','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
    cur.execute("INSERT INTO table VALUES " + args_str) 
    

    and 2 minutes when using this method:

    cur.executemany("INSERT INTO table VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s)", tup)
    
    0 讨论(0)
  • 2020-11-22 09:42

    I've been using ant32's answer above for several years. However I've found that is thorws an error in python 3 because mogrify returns a byte string.

    Converting explicitly to bytse strings is a simple solution for making code python 3 compatible.

    args_str = b','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup) 
    cur.execute(b"INSERT INTO table VALUES " + args_str)
    
    0 讨论(0)
  • 2020-11-22 09:42

    The cursor.copyfrom solution as provided by @jopseph.sheedy (https://stackoverflow.com/users/958118/joseph-sheedy) above (https://stackoverflow.com/a/30721460/11100064) is indeed lightning fast.

    However, the example he gives are not generically usable for a record with any number of fields and it took me while to figure out how to use it correctly.

    The IteratorFile needs to be instantiated with tab-separated fields like this (r is a list of dicts where each dict is a record):

        f = IteratorFile("{0}\t{1}\t{2}\t{3}\t{4}".format(r["id"],
            r["type"],
            r["item"],
            r["month"],
            r["revenue"]) for r in records)
    

    To generalise for an arbitrary number of fields we will first create a line string with the correct amount of tabs and field placeholders : "{}\t{}\t{}....\t{}" and then use .format() to fill in the field values : *list(r.values())) for r in records:

            line = "\t".join(["{}"] * len(records[0]))
    
            f = IteratorFile(line.format(*list(r.values())) for r in records)
    

    complete function in gist here.

    0 讨论(0)
  • 2020-11-22 09:48

    cursor.copy_from is the fastest solution I've found for bulk inserts by far. Here's a gist I made containing a class named IteratorFile which allows an iterator yielding strings to be read like a file. We can convert each input record to a string using a generator expression. So the solution would be

    args = [(1,2), (3,4), (5,6)]
    f = IteratorFile(("{}\t{}".format(x[0], x[1]) for x in args))
    cursor.copy_from(f, 'table_name', columns=('a', 'b'))
    

    For this trivial size of args it won't make much of a speed difference, but I see big speedups when dealing with thousands+ of rows. It will also be more memory efficient than building a giant query string. An iterator would only ever hold one input record in memory at a time, where at some point you'll run out of memory in your Python process or in Postgres by building the query string.

    0 讨论(0)
  • 2020-11-22 09:48

    All of these techniques are called 'Extended Inserts" in Postgres terminology, and as of the 24th of November 2016, it's still a ton faster than psychopg2's executemany() and all the other methods listed in this thread (which i tried before coming to this answer).

    Here's some code which doesnt use cur.mogrify and is nice and simply to get your head around:

    valueSQL = [ '%s', '%s', '%s', ... ] # as many as you have columns.
    sqlrows = []
    rowsPerInsert = 3 # more means faster, but with diminishing returns..
    for row in getSomeData:
            # row == [1, 'a', 'yolo', ... ]
            sqlrows += row
            if ( len(sqlrows)/len(valueSQL) ) % rowsPerInsert == 0:
                    # sqlrows == [ 1, 'a', 'yolo', 2, 'b', 'swag', 3, 'c', 'selfie' ]
                    insertSQL = 'INSERT INTO "twitter" VALUES ' + ','.join(['(' + ','.join(valueSQL) + ')']*rowsPerInsert)
                    cur.execute(insertSQL, sqlrows)
                    con.commit()
                    sqlrows = []
    insertSQL = 'INSERT INTO "twitter" VALUES ' + ','.join(['(' + ','.join(valueSQL) + ')']*len(sqlrows))
    cur.execute(insertSQL, sqlrows)
    con.commit()
    

    But it should be noted that if you can use copy_from(), you should use copy_from ;)

    0 讨论(0)
提交回复
热议问题