Improving performance with a Similarity Postgres fuzzy self join query

丶灬走出姿态 提交于 2020-01-02 07:21:28

问题


I am trying to run a query that joins a table against itself and does fuzzy string comparison (using trigram comparisons) to find possible company name matches. My goal is to return records where the trigram similarity of one record's company name (ref_name field) matches another record's company name. Currently, I have my threshold set to 0.9 so it will only bring back matches that are very likely to contain the a similar string.

I know that self joins can result in many comparisons by nature, but I want to optimize my query the best I can. I don't need results instantaneously, but currently the query I am running takes 11 hours to run.

I am running Postgres 9.2 on a Ubuntu 12.04 server. I don't know what the max length of the ref_name field (field I'm matching on) is, so I set it to a varchar(300). I wonder if setting it to a text type may affect performance at all or if there is a better field type to use to speed up performance. My LC_CTYPE and LC_COLLATE locales are set to "en_US.UTF-8"

The table I am running the query on consists of about 1.6 million records in total, but the query that takes me 11 hours to run is on a small subset of that (about 100k).

Table Structure:

CREATE TABLE ref_name (
  ref_name_id integer,
  ref_name character varying(300),
  ref_name_type character varying(2),
  name_display text,
  load_date timestamp without time zone
)

Indexes:

CREATE INDEX ref_name_ref_name_trigram_idx ON ref_name
  USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops);

CREATE INDEX ref_name_ref_name_trigram_idx_1 ON ref_name
  USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops)
  WHERE ref_name_type::text = 'E'::text;

CREATE INDEX ref_name_ref_name_e_idx ON ref_name
  USING btree (ref_name COLLATE pg_catalog."default")
  WHERE ref_name_type::text = 'E'::text;

Query:

select a.ref_name_id as name_id,a.ref_name AS name,
  a.name_display AS name_display,b.ref_name_id AS matched_name_id,
  b.ref_name AS matched_name,b.name_display AS matched_name_display
from ref_name a
JOIN ref_name b
 ON a.ref_name_id<>b.ref_name_id
 AND a.ref_name_id>b.ref_name_id
 AND a.ref_name % b.ref_name
WHERE 
 a.ref_name ~>=~ 'A' and a.ref_name ~<~'B'
 AND b.ref_name ~>=~ 'A' and b.ref_name ~<~'B'
 AND a.ref_name_type='E'
 AND b.ref_name_type='E'

Explain Plan:

"Nested Loop  (cost=0.00..8560728.16 rows=3598470 width=96)"
"  ->  Seq Scan on ref_name a  (cost=0.00..96556.12 rows=103901 width=48)"
"        Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND ((ref_name_type)::text = 'E'::text))"
"  ->  Index Scan using ref_name_ref_name_trigram_idx_1 on ref_name b  (cost=0.00..80.41 rows=35 width=48)"
"        Index Cond: ((a.ref_name)::text % (ref_name)::text)"
"        Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND (a.ref_name_id <> ref_name_id) AND (a.ref_name_id > ref_name_id))"

Here are some sample records:

1652632;"A 123 SYSTEMS";"E";"A 123 SYSTEMS INC";"2014-11-14 00:00:00"
1652633;"A123 SYSTEMS";"E";"A123 SYSTEMS INC";"2014-11-14 00:00:00"
1652640;"A 1 ACCOUSTICS";"E";"A-1 ACCOUSTICS";"2014-11-14 00:00:00"
1652641;"A 1 ACOUSTICS";"E";"A-1 ACOUSTICS";"2014-11-14 00:00:00"
1652642;"A1 ACOUSTICS";"E";"A1 ACOUSTICS INC";"2014-11-14 00:00:00"
1652650;"A 1 A ELECTRICAL";"E";"A-1 A ELECTRICAL INC";"2014-11-14 00:00:00"
1652651;"A 1 A ELECTRICIAN";"E";"A 1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652652;"A 1A ELECTRICIAN";"E";"A 1A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652653;"A1 A ELECTRICIAN";"E";"A1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1691270;"ALBERT GARLATTI";"E";"ALBERT GARLATTI";"2014-11-14 00:00:00"
1691271;"ALBERT GARLATTI CONSTRUCTION";"E";"ALBERT GARLATTI CONSTRUCTION CO";"2014-11-14 00:00:00"
1680892;"AG HOG PITTSBURGH";"E";"AG-HOG PITTSBURGH CO INC";"2014-11-14 00:00:00"
1680893;"AGHOG PITTSBURGH";"E";"AGHOG PITTSBURGH CO";"2014-11-14 00:00:00"
1680928;"AGILE PURSUITS FRACHISING";"E";"AGILE PURSUITS FRACHISING INC";"2014-11-14 00:00:00"
1680929;"AGILE PURSUITS FRANCHISING";"E";"AGILE PURSUITS FRANCHISING INC";"2014-11-14 00:00:00"
1680956;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"2014-11-14 00:00:00"
1680957;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"2014-11-14 00:00:00"

As you can see, I created a gist trigram index to speed things up (tried two different types so far for comparison). Does anyone have any suggestions on how I can improve the performance of this query and get it down from 11 hours to something more manageable? Eventually I would like to run this query on the whole table to compare records, not just this small subset.


回答1:


Indices

The partial GiST index is good, I would at least test these additional two indices:

A GIN index:

CREATE INDEX ref_name_trgm_gin_idx ON ref_name
USING gin (ref_name gin_trgm_ops)
WHERE ref_name_type = 'E';

This may or may not be used. If you upgrade to Postgres 9.4, chances are much better because there have been major improvements to GIN indexes.

A varchar_pattern_ops index:

CREATE INDEX ref_name_pattern_ops_idx
ON ref_name (ref_name varchar_pattern_ops)
WHERE ref_name_type = 'E';

Query

The problem at the heart of this query that you are running into a cross join with O(N²) when checking all rows against all rows. Performance becomes unbearable with a very big number of rows. You seem to be well aware of the dynamic. The defense is to limit possible combinations. You took a step in that direction already with limiting to the same first letter.

A very good option here is build on a special talent of GiST indices for nearest neighbour search. There is an hint in the manual for this query technique:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

A GIN index may still get used in addition to the GiST index. You have to weigh cost and benefit. May be cheaper overall to stick with one big index in versions before 9.4. But it's probably worth it in pg 9.4.

Postgres 9.2

Use correlated subqueries to substitute for the not yet existing missing LATERAL join:

SELECT a.*
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM  (
   SELECT ref_name_id
        , ref_name
        , name_display
        , (SELECT ref_name_id AS match_name_id
           FROM   ref_name b
           WHERE  ref_name_type = 'E'
           AND    ref_name ~~ 'A%'
           AND    ref_name_id > a.ref_name_id
           AND    ref_name % a.ref_name
           ORDER  BY ref_name <-> a.ref_name
           LIMIT  1                                -- max. 1 best match
          )
   FROM   ref_name a
   WHERE  ref_name ~~ 'A%'
   AND    ref_name_type = 'E'
   ) a
JOIN   ref_name b ON b.ref_name_id = a.match_name_id
ORDER  BY 1;

Obviously, this also needs an index on ref_name_id, which should normally be the PK and therefore indexed automatically.

I added two more variants in the SQL Fiddle.

Postgres 9.3+

Use a LATERAL join for matching set to set. Similar to chapter 2a in this related answer:

  • Optimize GROUP BY query to retrieve latest record per user

SELECT a.ref_name_id
     , a.ref_name
     , a.name_display
     , b.ref_name_id  AS match_name_id
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM   ref_name a
,   LATERAL (
   SELECT b.ref_name_id, b.ref_name, b.name_display
   FROM   ref_name b
   WHERE  b.ref_name ~~ 'A%'
   AND    b.ref_name_type = 'E'
   AND    a.ref_name_id < b.ref_name_id
   AND    a.ref_name % b.ref_name  -- also enforce min. similarity
   ORDER  BY a.ref_name <-> b.ref_name
   LIMIT  10                                -- max. 10 best matches
   ) b
WHERE  a.ref_name ~~ 'A%'   -- you can extend the search
AND    a.ref_name_type = 'E'
ORDER  BY 1;

SQL Fiddle with all variants compared to your original query on 40k rows modeled after your case.

Queries are 2 - 5 x faster as your original in the fiddle. And I expect them to scale much better with millions of rows. You'll have to test.

Extending the search for matches in b to all rows (while limiting candidates in a to a reasonable number) is rather cheap, too. I added two other variants to the fiddle.

Aside: I ran all tests with text instead of varchar, but that shouldn't make a difference.

Basics and links:

  • Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL


来源:https://stackoverflow.com/questions/29265770/improving-performance-with-a-similarity-postgres-fuzzy-self-join-query

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!