Why is the query planner unable to transform a correlated subquery?

老子叫甜甜 提交于 2019-12-07 13:35:40

问题


In How does PostgreSQL approach a 1 + n query?, I learned that a correlated subquery can be rewritten as a left join:

select   film_id, title,
         (
           select     array_agg(first_name)
           from       actor
           inner join film_actor using(actor_id)
           where      film_actor.film_id = film.film_id
         ) as actors
from     film
order by title;

to

select   f.film_id, f.title, array_agg(a.first_name)
from     film f
   left join film_actor fa using(film_id)
   left join actor      a  using(actor_id)
group by f.film_id
order by f.title;

Bot queries return the same results, but the second query performs better.

This makes me wonder: why is the query planner unable to do such transformations by itself?

I can see why not all correlated subqueries could be transformed to a join, but I don't see any issues with this particular query.

update performance

I tried to compare the performance as following. I executed 2 consecutive loops of 100 times the first query, followed by 2 consecutive loops of 100 times the second query. I ignored the first loop in both cases, as I considered that a warm-up loop.

I get 16 seconds for 100x the first query and 11 seconds for 100x the second query.

The explains are as following:

correlated subquery:

 Index Scan using idx_title on film  (cost=0.28..24949.50 rows=1000 width=51) (actual time=0.690..74.828 rows=1000 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=24.84..24.85 rows=1 width=32) (actual time=0.068..0.068 rows=1 loops=1000)
       ->  Hash Join  (cost=10.82..24.82 rows=5 width=6) (actual time=0.034..0.055 rows=5 loops=1000)
         Hash Cond: (film_actor.actor_id = actor.actor_id)
         ->  Bitmap Heap Scan on film_actor  (cost=4.32..18.26 rows=5 width=2) (actual time=0.025..0.040 rows=5 loops=1000)
               Recheck Cond: (film_id = film.film_id)
               Heap Blocks: exact=5075
               ->  Bitmap Index Scan on idx_fk_film_id  (cost=0.00..4.32 rows=5 width=0) (actual time=0.015..0.015 rows=5 loops=1000)
                 Index Cond: (film_id = film.film_id)
         ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.338..0.338 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 17kB
               ->  Seq Scan on actor  (cost=0.00..4.00 rows=200 width=10) (actual time=0.021..0.133 rows=200 loops=1)
 Planning time: 1.277 ms
 Execution time: 75.525 ms

join:

 Sort  (cost=748.60..751.10 rows=1000 width=51) (actual time=35.865..36.060 rows=1000 loops=1)
   Sort Key: f.title
   Sort Method: quicksort  Memory: 199kB
   ->  GroupAggregate  (cost=645.31..698.78 rows=1000 width=51) (actual time=23.953..34.204 rows=1000 loops=1)
     Group Key: f.film_id
     ->  Sort  (cost=645.31..658.97 rows=5462 width=25) (actual time=23.910..25.210 rows=5465 loops=1)
           Sort Key: f.film_id
           Sort Method: quicksort  Memory: 619kB
           ->  Hash Left Join  (cost=84.00..306.25 rows=5462 width=25) (actual time=2.098..16.237 rows=5465 loops=1)
             Hash Cond: (fa.actor_id = a.actor_id)
             ->  Hash Right Join  (cost=77.50..231.03 rows=5462 width=21) (actual time=1.786..10.636 rows=5465 loops=1)
               Hash Cond: (fa.film_id = f.film_id)
               ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=4) (actual time=0.018..2.221 rows=5462 loops=1)
               ->  Hash  (cost=65.00..65.00 rows=1000 width=19) (actual time=1.753..1.753 rows=1000 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 59kB
                 ->  Seq Scan on film f  (cost=0.00..65.00 rows=1000 width=19) (actual time=0.029..0.819 rows=1000 loops=1)
             ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.286..0.286 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 17kB
               ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=10) (actual time=0.016..0.114 rows=200 loops=1)
 Planning time: 1.648 ms
 Execution time: 36.599 ms

回答1:


This is too much for a comment.

The rewrite of your Correlated Subquery should be like this:

select film_id, title, a.actors
from   film
left join
  (         
           select     film_actor.film_id, array_agg(first_name) as actors
           from       actor
           inner join film_actor using(actor_id)
           group by   film_actor.film_id
  ) as a
on a.film_id = film.film_id
order by title;

Regarding performance, Scalar Correlated Subqueries simply seem to be hard for optimizers, I wouldn't expect them to perform the same or better than a manual rewrite.




回答2:


I'm a little surprised that the second one performs better. For instance, the second one should get a syntax error, because the order by is before the group by -- but I get your point.

But the answer to your question is that although SQL is a descriptive language rather than a procedural language, the structure of the query can -- for some databases -- affect the execution plan. If you have looked at the explains, then this is clearly the case for these two queries.

The more important answer is that although the queries look the same, they are not semantically equal. In particular, if film.film_id is not unique, they return different answers.



来源:https://stackoverflow.com/questions/50434001/why-is-the-query-planner-unable-to-transform-a-correlated-subquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!