问题
I have list of products (3 million items) without IDs - only titles. But I don't know which titles already exist in DB. New products (about 2.9 million items) must be added into DB. After that I must know ID for each products (new and existing).
Is there the fastest way to do it in PostgreSQL? I can change DB as needed (add default values, add columns etc.).
回答1:
Import data
COPY everything to a temporary staging table and insert only new titles into your target table.
CREATE TEMP TABLE tmp(title text);
COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp;
INSERT INTO tbl
SELECT DISTINCT tmp.title
FROM tmp
LEFT JOIN tbl USING (title)
WHERE tbl.title IS NULL;
IDs should be generated automatically with a serial column tbl_id
in tbl
.
The LEFT JOIN
/ IS NULL
construct disqualifies already existing titles. NOT EXISTS
would be another possibility.
DISTINCT
prevents duplicates in the incoming data in the temporary table tmp
.
ANALYZE is useful to make sure the query planner picks a sensible plan, and temporary tables are not analyzed by autovacuum.
Since you have 3 million items, it might pay to raise the setting for temp_buffer
(for this session only):
SET temp_buffers = 1000MB;
Or however much you can afford and is enough to hold the temp table in RAM, which is much faster. Note: must be done first in the session - before any temp objects are created.
Retrieve IDs
To see all IDs for the imported data:
SELECT tbl.tbl_id, tbl.title
FROM tbl
JOIN tmp USING (title)
In the same session! A temporary table is dropped automatically at the end of the session.
来源:https://stackoverflow.com/questions/15834569/how-to-bulk-insert-only-new-rows-in-postresql