I am trying to run a query that joins a table against itself and does fuzzy string comparison (using trigram comparisons) to find possible company name matches. My goal is to return records where the trigram similarity of one record's company name (ref_name field) matches another record's company name. Currently, I have my threshold set to 0.9 so it will only bring back matches that are very likely to contain the a similar string.
I know that self joins can result in many comparisons by nature, but I want to optimize my query the best I can. I don't need results instantaneously, but currently the query I am running takes 11 hours to run.
I am running Postgres 9.2 on a Ubuntu 12.04 server. I don't know what the max length of the ref_name field (field I'm matching on) is, so I set it to a varchar(300)
. I wonder if setting it to a text type may affect performance at all or if there is a better field type to use to speed up performance. My LC_CTYPE
and LC_COLLATE
locales are set to "en_US.UTF-8"
The table I am running the query on consists of about 1.6 million records in total, but the query that takes me 11 hours to run is on a small subset of that (about 100k).
Table Structure:
CREATE TABLE ref_name (
ref_name_id integer,
ref_name character varying(300),
ref_name_type character varying(2),
name_display text,
load_date timestamp without time zone
)
Indexes:
CREATE INDEX ref_name_ref_name_trigram_idx ON ref_name
USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops);
CREATE INDEX ref_name_ref_name_trigram_idx_1 ON ref_name
USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops)
WHERE ref_name_type::text = 'E'::text;
CREATE INDEX ref_name_ref_name_e_idx ON ref_name
USING btree (ref_name COLLATE pg_catalog."default")
WHERE ref_name_type::text = 'E'::text;
Query:
select a.ref_name_id as name_id,a.ref_name AS name,
a.name_display AS name_display,b.ref_name_id AS matched_name_id,
b.ref_name AS matched_name,b.name_display AS matched_name_display
from ref_name a
JOIN ref_name b
ON a.ref_name_id<>b.ref_name_id
AND a.ref_name_id>b.ref_name_id
AND a.ref_name % b.ref_name
WHERE
a.ref_name ~>=~ 'A' and a.ref_name ~<~'B'
AND b.ref_name ~>=~ 'A' and b.ref_name ~<~'B'
AND a.ref_name_type='E'
AND b.ref_name_type='E'
Explain Plan:
"Nested Loop (cost=0.00..8560728.16 rows=3598470 width=96)"
" -> Seq Scan on ref_name a (cost=0.00..96556.12 rows=103901 width=48)"
" Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND ((ref_name_type)::text = 'E'::text))"
" -> Index Scan using ref_name_ref_name_trigram_idx_1 on ref_name b (cost=0.00..80.41 rows=35 width=48)"
" Index Cond: ((a.ref_name)::text % (ref_name)::text)"
" Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND (a.ref_name_id <> ref_name_id) AND (a.ref_name_id > ref_name_id))"
Here are some sample records:
1652632;"A 123 SYSTEMS";"E";"A 123 SYSTEMS INC";"2014-11-14 00:00:00"
1652633;"A123 SYSTEMS";"E";"A123 SYSTEMS INC";"2014-11-14 00:00:00"
1652640;"A 1 ACCOUSTICS";"E";"A-1 ACCOUSTICS";"2014-11-14 00:00:00"
1652641;"A 1 ACOUSTICS";"E";"A-1 ACOUSTICS";"2014-11-14 00:00:00"
1652642;"A1 ACOUSTICS";"E";"A1 ACOUSTICS INC";"2014-11-14 00:00:00"
1652650;"A 1 A ELECTRICAL";"E";"A-1 A ELECTRICAL INC";"2014-11-14 00:00:00"
1652651;"A 1 A ELECTRICIAN";"E";"A 1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652652;"A 1A ELECTRICIAN";"E";"A 1A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652653;"A1 A ELECTRICIAN";"E";"A1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1691270;"ALBERT GARLATTI";"E";"ALBERT GARLATTI";"2014-11-14 00:00:00"
1691271;"ALBERT GARLATTI CONSTRUCTION";"E";"ALBERT GARLATTI CONSTRUCTION CO";"2014-11-14 00:00:00"
1680892;"AG HOG PITTSBURGH";"E";"AG-HOG PITTSBURGH CO INC";"2014-11-14 00:00:00"
1680893;"AGHOG PITTSBURGH";"E";"AGHOG PITTSBURGH CO";"2014-11-14 00:00:00"
1680928;"AGILE PURSUITS FRACHISING";"E";"AGILE PURSUITS FRACHISING INC";"2014-11-14 00:00:00"
1680929;"AGILE PURSUITS FRANCHISING";"E";"AGILE PURSUITS FRANCHISING INC";"2014-11-14 00:00:00"
1680956;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"2014-11-14 00:00:00"
1680957;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"2014-11-14 00:00:00"
As you can see, I created a gist trigram index to speed things up (tried two different types so far for comparison). Does anyone have any suggestions on how I can improve the performance of this query and get it down from 11 hours to something more manageable? Eventually I would like to run this query on the whole table to compare records, not just this small subset.
Indices
The partial GiST index is good, I would at least test these additional two indices:
A GIN index:
CREATE INDEX ref_name_trgm_gin_idx ON ref_name
USING gin (ref_name gin_trgm_ops)
WHERE ref_name_type = 'E';
This may or may not be used. If you upgrade to Postgres 9.4, chances are much better because there have been major improvements to GIN indexes.
A varchar_pattern_ops index:
CREATE INDEX ref_name_pattern_ops_idx
ON ref_name (ref_name varchar_pattern_ops)
WHERE ref_name_type = 'E';
Query
The problem at the heart of this query that you are running into a cross join with O(N²) when checking all rows against all rows. Performance becomes unbearable with a very big number of rows. You seem to be well aware of the dynamic. The defense is to limit possible combinations. You took a step in that direction already with limiting to the same first letter.
A very good option here is build on a special talent of GiST indices for nearest neighbour search. There is an hint in the manual for this query technique:
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
A GIN index may still get used in addition to the GiST index. You have to weigh cost and benefit. May be cheaper overall to stick with one big index in versions before 9.4. But it's probably worth it in pg 9.4.
Postgres 9.2
Use correlated subqueries to substitute for the not yet existing missing LATERAL
join:
SELECT a.*
, b.ref_name AS match_name
, b.name_display AS match_name_display
FROM (
SELECT ref_name_id
, ref_name
, name_display
, (SELECT ref_name_id AS match_name_id
FROM ref_name b
WHERE ref_name_type = 'E'
AND ref_name ~~ 'A%'
AND ref_name_id > a.ref_name_id
AND ref_name % a.ref_name
ORDER BY ref_name <-> a.ref_name
LIMIT 1 -- max. 1 best match
)
FROM ref_name a
WHERE ref_name ~~ 'A%'
AND ref_name_type = 'E'
) a
JOIN ref_name b ON b.ref_name_id = a.match_name_id
ORDER BY 1;
Obviously, this also needs an index on ref_name_id
, which should normally be the PK and therefore indexed automatically.
I added two more variants in the SQL Fiddle.
Postgres 9.3+
Use a LATERAL
join for matching set to set. Similar to chapter 2a in this related answer:
SELECT a.ref_name_id
, a.ref_name
, a.name_display
, b.ref_name_id AS match_name_id
, b.ref_name AS match_name
, b.name_display AS match_name_display
FROM ref_name a
, LATERAL (
SELECT b.ref_name_id, b.ref_name, b.name_display
FROM ref_name b
WHERE b.ref_name ~~ 'A%'
AND b.ref_name_type = 'E'
AND a.ref_name_id < b.ref_name_id
AND a.ref_name % b.ref_name -- also enforce min. similarity
ORDER BY a.ref_name <-> b.ref_name
LIMIT 10 -- max. 10 best matches
) b
WHERE a.ref_name ~~ 'A%' -- you can extend the search
AND a.ref_name_type = 'E'
ORDER BY 1;
SQL Fiddle with all variants compared to your original query on 40k rows modeled after your case.
Queries are 2 - 5 x faster as your original in the fiddle. And I expect them to scale much better with millions of rows. You'll have to test.
Extending the search for matches in b
to all rows (while limiting candidates in a
to a reasonable number) is rather cheap, too. I added two other variants to the fiddle.
Aside: I ran all tests with text
instead of varchar
, but that shouldn't make a difference.
Basics and links:
来源:https://stackoverflow.com/questions/29265770/improving-performance-with-a-similarity-postgres-fuzzy-self-join-query