Finding similar strings with PostgreSQL quickly

徘徊边缘 提交于 2019-11-26 18:42:16

In Postgres 9.6 the configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit(). The functions are deprecated but still work.

Performance of GIN and GiST indexes also improved in multiple ways since Postgres 9.1.


Use set_limit() and the % operator instead. Both are provided by the pg_trgm module.

The way you have it, the similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Try instead:

SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
-- SELECT set_limit(0.8);               -- for older versions

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).


As to your subsidiary question:

WHERE ... sim > 0.8

Does not work because you cannot refer to output columns in WHERE or HAVING clauses. That's according to the (a bit confusing, granted) SQL standard - which is handled rather loosely by certain other RDBMS.

On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. Details:

Test case

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

But not in this particular case:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!