Psycopg2, Postgresql, Python: Fastest way to bulk-insert

前端未结

关注

 8  1343

I\'m looking for the most efficient way to bulk-insert some millions of tuples into a database. I\'m using Python, PostgreSQL and psycopg2.

I have created a long lis

相关标签:

8条回答

眼角桃花

2020-11-30 22:55

A very related question: Bulk insert with SQLAlchemy ORM

All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.

In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().

In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-11-30 22:57
You could use a new upsert library:
```
$ pip install upsert
```
(you may have to pip install decorator first)
```
conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
    upsert.row(selector, setter)
```
Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }

It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2020-11-30 23:08

The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-11-30 23:10

in my experience executemany is not any faster than running many inserts yourself, the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow

i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-11-30 23:14
After some testing, unnest often seems to be an extremely fast option, as I learned from @Clodoaldo Neto's answer to a similar question.
```
data = [(1, 100), (2, 200), ...]  # list of tuples

cur.execute("""CREATE TABLE table1 AS
               SELECT u.id, u.var1
               FROM unnest(%s) u(id INT, var1 INT)""", (data,))
```
However, it can be tricky with extremely large data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-11-30 23:15

There is a new psycopg2 manual containing examples for all the options.

The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页