问题
I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search.
When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index:
CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);
This is my query:
SELECT address, similarity(address, '981 maun st') AS sml
FROM addresses
WHERE address % '981 maun st'
ORDER BY sml DESC
LIMIT 10;
The test table on production environment has been removed. I show the EXPLAIN
output from my test environment. There are about 7 million rows and it needs about 1.6s to get results. With 300 million, it needs more than 30s.
ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
QUERY PLAN
————————————————————————————————————————————————————————————————————————————————
Limit (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
-> Sort (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
Sort Key: (similarity((address)::text, '781 maun st'::text))
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using addresses_trgm_index on addresses (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
Index Cond: ((address)::text % '781 maun st'::text)
Total runtime: 1661.066 ms
(7 rows)
Is there a good way to improve the performance or is it a good plan to do table partitioning?
回答1:
PostgreSQL 9.3 ... Is there a good way to improve the performance or is it a good plan to do table partitioning?
Table partitioning will not help at all.
But yes, there is a good way: Upgrade to a current version of Postgres. There have been many improvements for GiST indexes, for the pg_trgm module in particular and for big data in general. Should be substantially faster with Postgres 9.6 or the upcoming Postgres 10 (currently beta).
You "nearest neighbor" looks correct but for a small LIMIT
use this equivalent query instead:
SELECT address, similarity(address, '981 maun st') AS sml
FROM addresses
WHERE address % '981 maun st'
ORDER BY address <-> '981 maun st'
LIMIT 10;
Quoting the manual:
It will usually beat the first formulation when only a small number of the closest matches is wanted.
来源:https://stackoverflow.com/questions/44773387/search-in-300-million-addresses-with-pg-trgm