I have a star schema here and I am querying the fact table and would like to join one very small dimension table. I can't really explain the following:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id
FROM bi.impressions imp
GROUP BY imp.os_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
Total runtime: 12559.523 ms
(3 rows)
This takes ~12600ms, but of course there is no joined data, so I can't "resolve" the imp.os_id to something meaningful, so I add a join:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp, bi.os_desc os
WHERE imp.os_id=os.os_id
GROUP BY imp.os_id, os.os_desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
-> Hash Join (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
Hash Cond: (imp.os_id = os.os_id)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
-> Hash (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Seq Scan on os_desc os (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
Total runtime: 25565.199 ms
(8 rows)
This effectively doubles the execution time of my query. My question is, what did I leave out from the picture? I would think such a small lookup was not causing huge difference in query execution time.
Rewritten with (recommended) explicit ANSI JOIN syntax:
SELECT COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os ON os.os_id = imp.os_id
GROUP BY imp.os_id, os.os_desc;
First of all, your second query might be wrong, if more or less than exactly one match are found in os_desc
for every row in impressions.
This can be ruled out if you have a foreign key constraint on os_id
in place, that guarantees referential integrity, plus a NOT NULL
constraint on bi.impressions.os_id
. If so, in a first step, simplify to:
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os USING (os_id)
GROUP BY imp.os_id, os.os_desc;
count(*)
is faster than count(column)
and equivalent here if the column is NOT NULL
. And add a column alias for the count.
Faster, yet:
SELECT os_id, os.os_desc, sub.ct
FROM (
SELECT os_id, COUNT(*) AS ct
FROM bi.impressions
GROUP BY 1
) sub
JOIN bi.os_desc os USING (os_id)
Group first, join later. More here:
HashAggregate (cost=868719.08..868719.24 rows=16 width=10)
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22)
Hmm, width from 10 to 22 is a doubling. Perhaps you should join after grouping instead of before?
The following query solves the problem without increasing the query execution time. The question still stands why does the execution time increase significantly with adding a very simple join, but it might be a Postgres specific question and somebody with extensive experience in the area might answer it eventually.
WITH
OSES AS (SELECT os_id,os_desc from bi.os_desc)
SELECT
COUNT(impression_id) as imp_count,
os_desc FROM bi.impressions imp,
OSES os
WHERE
os.os_id=imp.os_id
GROUP BY os_desc
ORDER BY imp_count;
来源:https://stackoverflow.com/questions/18970357/why-does-the-following-join-increase-the-query-time-significantly