Spatial query on large table with multiple self joins performing slow

心不动则不痛 提交于 2019-12-01 06:59:39
Erwin Brandstetter

This query should go a long way (be much faster):

WITH school AS (
   SELECT s.osm_id AS school_id, text 'school' AS type, s.osm_id, s.name, s.way_geo
   FROM   planet_osm_point s
        , LATERAL (
      SELECT  1 FROM planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'bar'
      LIMIT   1  -- bar exists -- most selective first if possible
      ) b
        , LATERAL (
      SELECT  1 FROM planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'restaurant'
      LIMIT   1  -- restaurant exists
      ) r
   WHERE  s.amenity = 'school'
   )
SELECT * FROM (
   TABLE school  -- schools

   UNION ALL  -- bars
   SELECT s.school_id, 'bar', x.*
   FROM   school s
        , LATERAL (
      SELECT  osm_id, name, way_geo
      FROM    planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'bar'
      ) x

   UNION ALL  -- restaurants
   SELECT s.school_id, 'rest.', x.*
   FROM   school s
        , LATERAL (
      SELECT  osm_id, name, way_geo
      FROM    planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'restaurant'
      ) x
   ) sub
ORDER BY school_id, (type <> 'school'), type, osm_id;

This is not the same as your original query, but rather what you actually want, as per discussion in comments:

I want a list of schools that have restaurants and bars within 500 meters and I need the coordinates of each school and its corresponding restaurants and bars.

So this query returns a list of those schools, followed by bars and restaurants nearby. Each set of rows is held together by the osm_id of the school in the column school_id.

Now using LATERAL joins, to make use of the spatial GiST index.

TABLE school is just shorthand for SELECT * FROM school:

The expression (type <> 'school') orders the school in each set first, because:

The subquery sub in the final SELECT is only needed to order by this expression. A UNION query limits an attached ORDER BY list to only columns, no expressions.

I focus on the query you presented for the purpose of this answer - ignoring the extended requirement to filter on any of the other 70 text columns. That's really a design flaw. The search criteria should be concentrated in few columns. Or you'll have to index all 70 columns, and multicolumn indexes like I am going to propose are hardly an option. Still possible though ...

Index

In addition to the existing:

"idx_planet_osm_point_waygeo" gist (way_geo)

If always filtering on the same column, you could create a multicolumn index covering the few columns you are interested in, so index-only scans become possible:

CREATE INDEX planet_osm_point_bar_idx ON planet_osm_point (amenity, name, osm_id)

Postgres 9.5

The upcoming Postgres 9.5 introduces major improvements that happen to address your case exactly:

  • Allow queries to perform accurate distance filtering of bounding-box-indexed objects (polygons, circles) using GiST indexes (Alexander Korotkov, Heikki Linnakangas)

    Previously, a common table expression was required to return a large number of rows ordered by bounding-box distance, and then filtered further with a more accurate non-bounding-box distance calculation.

  • Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

That's of particular interest for you. Now you can have a single multicolumn (covering) GiST index:

CREATE INDEX reservations_range_idx ON reservations
USING gist(amenity, way_geo, name, osm_id)

And:

  • Improve bitmap index scan performance (Teodor Sigaev, Tom Lane)

And:

  • Add GROUP BY analysis functions GROUPING SETS, CUBE and ROLLUP (Andrew Gierth, Atri Sharma)

Why? Because ROLLUP would simplify the query I suggested. Related answer:

The first alpha version has been released on July 2, 2015. The expected timeline for the release:

This is the alpha release of version 9.5, indicating that some changes to features are still possible before release. The PostgreSQL Project will release 9.5 beta 1 in August, and then periodically release additional betas as required for testing until the final release in late 2015.

Basics

Of course, be sure not to overlook the basics:

The 3 sub-selects that you use are very inefficient. Write them as LEFT JOIN clauses and the query should be much more efficient:

SELECT
  school.osm_id AS school_osm_id, 
  school.name AS school_name, 
  school.way AS school_way, 
  restaurant.osm_id AS restaurant_osm_id, 
  restaurant.name AS restaurant_name, 
  restaurant.way AS restaurant_way, 
  bar.osm_id AS bar_osm_id, 
  bar.name AS bar_name, 
  bar.way AS bar_way 
FROM planet_osm_point school
LEFT JOIN planet_osm_point restaurant ON restaurant.amenity = 'restaurant' AND
                               ST_DWithin(school.way_geo, restaurant.way_geo, 500, false) 
LEFT JOIN planet_osm_point bar ON bar.amenity = 'bar' AND
                               ST_DWithin(school.way_geo, bar.way_geo, 500, false)
WHERE school.amenity = 'school'
  AND (restaurant.osm_id IS NOT NULL OR bar.osm_id IS NOT NULL);

But this will give too many results if you have multiple restaurants and bars per school. You can simplify the query like this:

SELECT
  school.osm_id AS school_osm_id, 
  school.name AS school_name, 
  school.way AS school_way, 
  a.osm_id AS amenity_osm_id, 
  a.amenity AS amenity_type,
  a.name AS amenity_name, 
  a.way AS amenity_way, 
FROM planet_osm_point school
JOIN planet_osm_point a ON ST_DWithin(school.way_geo, a.way_geo, 500, false) 
WHERE school.amenity = 'school'
  AND a.amenity IN ('bar', 'restaurant');

This will give every bar and restaurant for each school. Schools without either restaurant or bar within 500m are not listed.

Does it make any difference if you use explicit joins?

SELECT a.id as a_id, a.name as a_name, a.geog as a_geog,
       b.id as b_id, b.name as b_name, b.geog as b_geog,
       c.id as c_id, c.name as c_name, c.geog as c_geog
FROM table1 a
JOIN table1 b ON b.type = 'B' AND ST_DWithin(a.geog, b.geog, 100)
JOIN table1 c ON c.type = 'C' AND ST_DWithin(a.geog, c.geog, 100)
WHERE a.type = 'A';

Try this with inner join syntax and compare the results, there should be no duplicates. My guess is it should take 1/3rd the time or better than the original query :

select a.id as a_id, a.name as a_name, a.geog as a_geo,
       b.id as b_id, b.name as b_name, b.geog as b_geo,
       c.id as c_id, c.name as c_name, c.geog as c_geo
from table1 as a
INNER JOIN table1 as b on b.type='B'
INNER JOIN table1 as c on c.type='C'
WHERE a.type='A' and
     (ST_DWithin(a.geo, b.geo, 100) and ST_DWithin(a.geo, c.geo, 100))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!