Regex pattern matching with pg_trgm (trigram matching)

点点圈 提交于 2021-01-28 07:44:09

问题


I have a database in postgresql called mydata with a field called text. I'm interested in doing regex pattern matching and only returning the snippet of the match, not the entire text. I know you can use pg_trgm (creates a trigram matching index) to speed up the search, but is there a way to do both the searching and matching as a combined statement?

I'll provide some context:

CREATE EXTENSION pg_trgm;
CREATE INDEX text_trgm_idx ON mydata USING GIN(text gin_trgm_ops);

I'll use the example regex pattern of '(1998.{0,10})', but I'm actually interested in any type of pattern, not just this example string.

Desired pattern match, but doesn't appear to use pg_trgm indexing (note title is another field, but not the one I'm matching on):

EXPLAIN ANALYZE SELECT title, regexp_matches(text, '(1998.{0,10})') FROM mydata;
 Seq Scan on mydata  (cost=0.00..2257.89 rows=201720 width=73)
 Planning time: 0.047 ms
 Execution time: 2493.105 ms

Now, adding the WHERE field.

EXPLAIN ANALYZE SELECT title, regexp_matches(text, '(1998.{0,10})') FROM mydata WHERE text ~ '(1998.{0,10})';
 Bitmap Heap Scan on mydata  (cost=28.01..35.88 rows=20 width=73) 
Rows Removed by Index Recheck: 20
   Heap Blocks: exact=723
   ->  Bitmap Index Scan on text_trgm_idx  (cost=0.00..28.01 rows=2 width=0) (actual time=0.930..0.930 rows=2059 loops=1)
         Index Cond: (text ~ '(1998.{0,10})'::text)
 Planning time: 15.889 ms
 Execution time: 1583.970 ms

However, if we removed the pattern match, we'd get even better performance, so I suspect we're doing the same work twice:

EXPLAIN ANALYZE SELECT title FROM mydata WHERE text ~ '(1998.{0,10})';
 Bitmap Heap Scan on mydata  (cost=28.01..35.78 rows=2 width=41)
 Recheck Cond: (text ~ '(1998.{0,10})'::text)
   Rows Removed by Index Recheck: 20
   Heap Blocks: exact=723
   ->  Bitmap Index Scan on text_trgm_idx  (cost=0.00..28.01 rows=2 width=0) (actual time=1.136..1.136 rows=2059 loops=1)
         Index Cond: (text ~ '(1998.{0,10})'::text)
 Planning time: 1.980 ms
 Execution time: 554.589 ms

Furthermore, if there are any suggestions on how to get the best performance when doing a regex pattern match in postgres, I would appreciate further material. I'm not constrained to any version of postgres.

来源:https://stackoverflow.com/questions/44229592/regex-pattern-matching-with-pg-trgm-trigram-matching

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!