问题
In How does PostgreSQL approach a 1 + n query?, I learned that a correlated subquery can be rewritten as a left join:
select film_id, title,
(
select array_agg(first_name)
from actor
inner join film_actor using(actor_id)
where film_actor.film_id = film.film_id
) as actors
from film
order by title;
to
select f.film_id, f.title, array_agg(a.first_name)
from film f
left join film_actor fa using(film_id)
left join actor a using(actor_id)
group by f.film_id
order by f.title;
Bot queries return the same results, but the second query performs better.
This makes me wonder: why is the query planner unable to do such transformations by itself?
I can see why not all correlated subqueries could be transformed to a join, but I don't see any issues with this particular query.
update performance
I tried to compare the performance as following. I executed 2 consecutive loops of 100 times the first query, followed by 2 consecutive loops of 100 times the second query. I ignored the first loop in both cases, as I considered that a warm-up loop.
I get 16 seconds for 100x the first query and 11 seconds for 100x the second query.
The explains are as following:
correlated subquery:
Index Scan using idx_title on film (cost=0.28..24949.50 rows=1000 width=51) (actual time=0.690..74.828 rows=1000 loops=1)
SubPlan 1
-> Aggregate (cost=24.84..24.85 rows=1 width=32) (actual time=0.068..0.068 rows=1 loops=1000)
-> Hash Join (cost=10.82..24.82 rows=5 width=6) (actual time=0.034..0.055 rows=5 loops=1000)
Hash Cond: (film_actor.actor_id = actor.actor_id)
-> Bitmap Heap Scan on film_actor (cost=4.32..18.26 rows=5 width=2) (actual time=0.025..0.040 rows=5 loops=1000)
Recheck Cond: (film_id = film.film_id)
Heap Blocks: exact=5075
-> Bitmap Index Scan on idx_fk_film_id (cost=0.00..4.32 rows=5 width=0) (actual time=0.015..0.015 rows=5 loops=1000)
Index Cond: (film_id = film.film_id)
-> Hash (cost=4.00..4.00 rows=200 width=10) (actual time=0.338..0.338 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 17kB
-> Seq Scan on actor (cost=0.00..4.00 rows=200 width=10) (actual time=0.021..0.133 rows=200 loops=1)
Planning time: 1.277 ms
Execution time: 75.525 ms
join:
Sort (cost=748.60..751.10 rows=1000 width=51) (actual time=35.865..36.060 rows=1000 loops=1)
Sort Key: f.title
Sort Method: quicksort Memory: 199kB
-> GroupAggregate (cost=645.31..698.78 rows=1000 width=51) (actual time=23.953..34.204 rows=1000 loops=1)
Group Key: f.film_id
-> Sort (cost=645.31..658.97 rows=5462 width=25) (actual time=23.910..25.210 rows=5465 loops=1)
Sort Key: f.film_id
Sort Method: quicksort Memory: 619kB
-> Hash Left Join (cost=84.00..306.25 rows=5462 width=25) (actual time=2.098..16.237 rows=5465 loops=1)
Hash Cond: (fa.actor_id = a.actor_id)
-> Hash Right Join (cost=77.50..231.03 rows=5462 width=21) (actual time=1.786..10.636 rows=5465 loops=1)
Hash Cond: (fa.film_id = f.film_id)
-> Seq Scan on film_actor fa (cost=0.00..84.62 rows=5462 width=4) (actual time=0.018..2.221 rows=5462 loops=1)
-> Hash (cost=65.00..65.00 rows=1000 width=19) (actual time=1.753..1.753 rows=1000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 59kB
-> Seq Scan on film f (cost=0.00..65.00 rows=1000 width=19) (actual time=0.029..0.819 rows=1000 loops=1)
-> Hash (cost=4.00..4.00 rows=200 width=10) (actual time=0.286..0.286 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 17kB
-> Seq Scan on actor a (cost=0.00..4.00 rows=200 width=10) (actual time=0.016..0.114 rows=200 loops=1)
Planning time: 1.648 ms
Execution time: 36.599 ms
回答1:
This is too much for a comment.
The rewrite of your Correlated Subquery should be like this:
select film_id, title, a.actors
from film
left join
(
select film_actor.film_id, array_agg(first_name) as actors
from actor
inner join film_actor using(actor_id)
group by film_actor.film_id
) as a
on a.film_id = film.film_id
order by title;
Regarding performance, Scalar Correlated Subqueries simply seem to be hard for optimizers, I wouldn't expect them to perform the same or better than a manual rewrite.
回答2:
I'm a little surprised that the second one performs better. For instance, the second one should get a syntax error, because the order by
is before the group by
-- but I get your point.
But the answer to your question is that although SQL is a descriptive language rather than a procedural language, the structure of the query can -- for some databases -- affect the execution plan. If you have looked at the explains, then this is clearly the case for these two queries.
The more important answer is that although the queries look the same, they are not semantically equal. In particular, if film.film_id
is not unique, they return different answers.
来源:https://stackoverflow.com/questions/50434001/why-is-the-query-planner-unable-to-transform-a-correlated-subquery