问题
I am working on queries on a large table in Postgres 9.3.9. It is a spatial dataset and it is spatially indexed. Say, I have need to find 3 types of objects: A, B and C. The criteria is that B and C are both within certain distance of A, say 500 meters.
My query is like this:
select
school.osm_id as school_osm_id,
school.name as school_name,
school.way as school_way,
restaurant.osm_id as restaurant_osm_id,
restaurant.name as restaurant_name,
restaurant.way as restaurant_way,
bar.osm_id as bar_osm_id,
bar.name as bar_name,
bar.way as bar_way
from (
select osm_id, name, amenity, way, way_geo
from planet_osm_point
where amenity = 'school') as school,
(select osm_id, name, amenity, way, way_geo
from planet_osm_point
where amenity = 'restaurant') as restaurant,
(select osm_id, name, amenity, way, way_geo
from planet_osm_point
where amenity = 'bar') as bar
where ST_DWithin(school.way_geo, restaurant.way_geo, 500, false)
and ST_DWithin(school.way_geo, bar.way_geo, 500, false);
This query gives me what I want, but it takes really long time, like 13 seconds to execute. I'm wondering if there is another way to write the query and make it more efficient.
Query plan:
Nested Loop (cost=74.43..28618.65 rows=1 width=177) (actual time=33.513..11235.212 rows=10591 loops=1)
Buffers: shared hit=530967 read=8733
-> Nested Loop (cost=46.52..28586.46 rows=1 width=174) (actual time=31.998..9595.212 rows=4235 loops=1)
Buffers: shared hit=389863 read=8707
-> Bitmap Heap Scan on planet_osm_point (cost=18.61..2897.83 rows=798 width=115) (actual time=7.862..150.607 rows=8811 loops=1)
Recheck Cond: (amenity = 'school'::text)
Buffers: shared hit=859 read=5204
-> Bitmap Index Scan on idx_planet_osm_point_amenity (cost=0.00..18.41 rows=798 width=0) (actual time=5.416..5.416 rows=8811 loops=1)
Index Cond: (amenity = 'school'::text)
Buffers: shared hit=3 read=24
-> Bitmap Heap Scan on planet_osm_point planet_osm_point_1 (cost=27.91..32.18 rows=1 width=115) (actual time=1.064..1.069 rows=0 loops=8811)
Recheck Cond: ((way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision)) AND (amenity = 'restaurant'::text))
Filter: ((planet_osm_point.way_geo && _st_expand(way_geo, 500::double precision)) AND _st_dwithin(planet_osm_point.way_geo, way_geo, 500::double precision, false))
Rows Removed by Filter: 0
Buffers: shared hit=389004 read=3503
-> BitmapAnd (cost=27.91..27.91 rows=1 width=0) (actual time=1.058..1.058 rows=0 loops=8811)
Buffers: shared hit=384528 read=2841
-> Bitmap Index Scan on idx_planet_osm_point_waygeo (cost=0.00..9.05 rows=137 width=0) (actual time=0.193..0.193 rows=64 loops=8811)
Index Cond: (way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision))
Buffers: shared hit=146631 read=2841
-> Bitmap Index Scan on idx_planet_osm_point_amenity (cost=0.00..18.41 rows=798 width=0) (actual time=0.843..0.843 rows=6291 loops=8811)
Index Cond: (amenity = 'restaurant'::text)
Buffers: shared hit=237897
-> Bitmap Heap Scan on planet_osm_point planet_osm_point_2 (cost=27.91..32.18 rows=1 width=115) (actual time=0.375..0.383 rows=3 loops=4235)
Recheck Cond: ((way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision)) AND (amenity = 'bar'::text))
Filter: ((planet_osm_point.way_geo && _st_expand(way_geo, 500::double precision)) AND _st_dwithin(planet_osm_point.way_geo, way_geo, 500::double precision, false))
Rows Removed by Filter: 1
Buffers: shared hit=141104 read=26
-> BitmapAnd (cost=27.91..27.91 rows=1 width=0) (actual time=0.368..0.368 rows=0 loops=4235)
Buffers: shared hit=127019
-> Bitmap Index Scan on idx_planet_osm_point_waygeo (cost=0.00..9.05 rows=137 width=0) (actual time=0.252..0.252 rows=363 loops=4235)
Index Cond: (way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision))
Buffers: shared hit=101609
-> Bitmap Index Scan on idx_planet_osm_point_amenity (cost=0.00..18.41 rows=798 width=0) (actual time=0.104..0.104 rows=779 loops=4235)
Index Cond: (amenity = 'bar'::text)
Buffers: shared hit=25410
Total runtime: 11238.605 ms
I'm only using one table at the moment with 1,372,711 rows. It has 73 columns:
Column | Type | Modifiers
--------------------+----------------------+---------------------------
osm_id | bigint |
access | text |
addr:housename | text |
addr:housenumber | text |
addr:interpolation | text |
admin_level | text |
aerialway | text |
aeroway | text |
amenity | text |
area | text |
barrier | text |
bicycle | text |
brand | text |
bridge | text |
boundary | text |
building | text |
capital | text |
construction | text |
covered | text |
culvert | text |
cutting | text |
denomination | text |
disused | text |
ele | text |
embankment | text |
foot | text |
generator:source | text |
harbour | text |
highway | text |
historic | text |
horse | text |
intermittent | text |
junction | text |
landuse | text |
layer | text |
leisure | text |
lock | text |
man_made | text |
military | text |
motorcar | text |
name | text |
natural | text |
office | text |
oneway | text |
operator | text |
place | text |
poi | text |
population | text |
power | text |
power_source | text |
public_transport | text |
railway | text |
ref | text |
religion | text |
route | text |
service | text |
shop | text |
sport | text |
surface | text |
toll | text |
tourism | text |
tower:type | text |
tunnel | text |
water | text |
waterway | text |
wetland | text |
width | text |
wood | text |
z_order | integer |
tags | hstore |
way | geometry(Point,4326) |
way_geo | geography |
gid | integer | not null default nextval('...
Indexes:
"planet_osm_point_pkey1" PRIMARY KEY, btree (gid)
"idx_planet_osm_point_amenity" btree (amenity)
"idx_planet_osm_point_waygeo" gist (way_geo)
"planet_osm_point_index" gist (way)
"planet_osm_point_pkey" btree (osm_id)
There are 8811, 6291, 779 rows in amenity school, restaurant and bar respectively.
回答1:
This query should go a long way (be much faster):
WITH school AS (
SELECT s.osm_id AS school_id, text 'school' AS type, s.osm_id, s.name, s.way_geo
FROM planet_osm_point s
, LATERAL (
SELECT 1 FROM planet_osm_point
WHERE ST_DWithin(way_geo, s.way_geo, 500, false)
AND amenity = 'bar'
LIMIT 1 -- bar exists -- most selective first if possible
) b
, LATERAL (
SELECT 1 FROM planet_osm_point
WHERE ST_DWithin(way_geo, s.way_geo, 500, false)
AND amenity = 'restaurant'
LIMIT 1 -- restaurant exists
) r
WHERE s.amenity = 'school'
)
SELECT * FROM (
TABLE school -- schools
UNION ALL -- bars
SELECT s.school_id, 'bar', x.*
FROM school s
, LATERAL (
SELECT osm_id, name, way_geo
FROM planet_osm_point
WHERE ST_DWithin(way_geo, s.way_geo, 500, false)
AND amenity = 'bar'
) x
UNION ALL -- restaurants
SELECT s.school_id, 'rest.', x.*
FROM school s
, LATERAL (
SELECT osm_id, name, way_geo
FROM planet_osm_point
WHERE ST_DWithin(way_geo, s.way_geo, 500, false)
AND amenity = 'restaurant'
) x
) sub
ORDER BY school_id, (type <> 'school'), type, osm_id;
This is not the same as your original query, but rather what you actually want, as per discussion in comments:
I want a list of schools that have restaurants and bars within 500 meters and I need the coordinates of each school and its corresponding restaurants and bars.
So this query returns a list of those schools, followed by bars and restaurants nearby. Each set of rows is held together by the osm_id
of the school in the column school_id
.
Now using LATERAL
joins, to make use of the spatial GiST index.
TABLE school
is just shorthand for SELECT * FROM school
:
- Is there a shortcut for SELECT * FROM in psql?
The expression (type <> 'school')
orders the school in each set first, because:
- SQL select query order by day and month
The subquery sub
in the final SELECT
is only needed to order by this expression. A UNION
query limits an attached ORDER BY
list to only columns, no expressions.
I focus on the query you presented for the purpose of this answer - ignoring the extended requirement to filter on any of the other 70 text columns. That's really a design flaw. The search criteria should be concentrated in few columns. Or you'll have to index all 70 columns, and multicolumn indexes like I am going to propose are hardly an option. Still possible though ...
Index
In addition to the existing:
"idx_planet_osm_point_waygeo" gist (way_geo)
If always filtering on the same column, you could create a multicolumn index covering the few columns you are interested in, so index-only scans become possible:
CREATE INDEX planet_osm_point_bar_idx ON planet_osm_point (amenity, name, osm_id)
Postgres 9.5
The upcoming Postgres 9.5 introduces major improvements that happen to address your case exactly:
Allow queries to perform accurate distance filtering of bounding-box-indexed objects (polygons, circles) using GiST indexes (Alexander Korotkov, Heikki Linnakangas)
Previously, a common table expression was required to return a large number of rows ordered by bounding-box distance, and then filtered further with a more accurate non-bounding-box distance calculation.
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)
That's of particular interest for you. Now you can have a single multicolumn (covering) GiST index:
CREATE INDEX reservations_range_idx ON reservations
USING gist(amenity, way_geo, name, osm_id)
And:
- Improve bitmap index scan performance (Teodor Sigaev, Tom Lane)
And:
- Add GROUP BY analysis functions
GROUPING SETS
,CUBE
andROLLUP
(Andrew Gierth, Atri Sharma)
Why? Because ROLLUP would simplify the query I suggested. Related answer:
- Grouping() equivalent in PostgreSQL?
The first alpha version has been released on July 2, 2015. The expected timeline for the release:
This is the alpha release of version 9.5, indicating that some changes to features are still possible before release. The PostgreSQL Project will release 9.5 beta 1 in August, and then periodically release additional betas as required for testing until the final release in late 2015.
Basics
Of course, be sure not to overlook the basics:
- Slow Query Questions page on the PostgreSQL Wiki
回答2:
The 3 sub-selects that you use are very inefficient. Write them as LEFT JOIN
clauses and the query should be much more efficient:
SELECT
school.osm_id AS school_osm_id,
school.name AS school_name,
school.way AS school_way,
restaurant.osm_id AS restaurant_osm_id,
restaurant.name AS restaurant_name,
restaurant.way AS restaurant_way,
bar.osm_id AS bar_osm_id,
bar.name AS bar_name,
bar.way AS bar_way
FROM planet_osm_point school
LEFT JOIN planet_osm_point restaurant ON restaurant.amenity = 'restaurant' AND
ST_DWithin(school.way_geo, restaurant.way_geo, 500, false)
LEFT JOIN planet_osm_point bar ON bar.amenity = 'bar' AND
ST_DWithin(school.way_geo, bar.way_geo, 500, false)
WHERE school.amenity = 'school'
AND (restaurant.osm_id IS NOT NULL OR bar.osm_id IS NOT NULL);
But this will give too many results if you have multiple restaurants and bars per school. You can simplify the query like this:
SELECT
school.osm_id AS school_osm_id,
school.name AS school_name,
school.way AS school_way,
a.osm_id AS amenity_osm_id,
a.amenity AS amenity_type,
a.name AS amenity_name,
a.way AS amenity_way,
FROM planet_osm_point school
JOIN planet_osm_point a ON ST_DWithin(school.way_geo, a.way_geo, 500, false)
WHERE school.amenity = 'school'
AND a.amenity IN ('bar', 'restaurant');
This will give every bar and restaurant for each school. Schools without either restaurant or bar within 500m are not listed.
回答3:
Does it make any difference if you use explicit joins?
SELECT a.id as a_id, a.name as a_name, a.geog as a_geog,
b.id as b_id, b.name as b_name, b.geog as b_geog,
c.id as c_id, c.name as c_name, c.geog as c_geog
FROM table1 a
JOIN table1 b ON b.type = 'B' AND ST_DWithin(a.geog, b.geog, 100)
JOIN table1 c ON c.type = 'C' AND ST_DWithin(a.geog, c.geog, 100)
WHERE a.type = 'A';
回答4:
Try this with inner join syntax and compare the results, there should be no duplicates. My guess is it should take 1/3rd the time or better than the original query :
select a.id as a_id, a.name as a_name, a.geog as a_geo,
b.id as b_id, b.name as b_name, b.geog as b_geo,
c.id as c_id, c.name as c_name, c.geog as c_geo
from table1 as a
INNER JOIN table1 as b on b.type='B'
INNER JOIN table1 as c on c.type='C'
WHERE a.type='A' and
(ST_DWithin(a.geo, b.geo, 100) and ST_DWithin(a.geo, c.geo, 100))
来源:https://stackoverflow.com/questions/31466837/spatial-query-on-large-table-with-multiple-self-joins-performing-slow