问题
This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN
results.
SELECT
COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
(size > 0) AND
total_sales.customer_id IN (
SELECT customers.id FROM customers WHERE customers.active = 't'
AND customers.visible = 't' AND customers.organization_id = 3
) AND
total_sales.product_category_id IN (
SELECT product_categories.id FROM product_categories
WHERE product_categories.organization_id = 3
) AND
total_sales.period_id = 193;
I've tried both the approach of INNER JOIN'ing customers
and product_categories
tables and doing an INNER SELECT. Both had the same time.
Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr
Postgres version:
PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
Tables and indexes:
CREATE TABLE total_sales (
id serial NOT NULL,
value double precision,
start_date date,
end_date date,
product_category_customer_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
processed boolean,
customer_id integer,
product_category_id integer,
period_id integer,
CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);
CREATE TABLE performance_analyses (
id serial NOT NULL,
total_sales_id integer,
status_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
size double precision,
period_size integer,
nominal_variation double precision,
percentual_variation double precision,
relative_performance double precision,
time_ago_max integer,
deseasonalized_series text,
significance character varying,
relevance character varying,
original_variation double precision,
last_level double precision,
quantiles text,
range text,
analysis_method character varying,
CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);
CREATE TABLE product_categories (
id serial NOT NULL,
name character varying,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external_id character varying,
CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);
CREATE TABLE customers (
id serial NOT NULL,
name character varying,
external_id character varying,
region_id integer,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
active boolean DEFAULT false,
visible boolean DEFAULT false,
segment_id integer,
"group" boolean,
group_id integer,
ticket_enabled boolean DEFAULT true,
CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);
Rows counts:
- customers - 6,970 rows
- product_categories - 34 rows
- performance_analyses - 1,012,346 rows
- total_sales - 7,104,441 rows
回答1:
Your query, rewritten and 100 % equivalent:
SELECT count(*)
FROM product_categories pc
JOIN customers c USING (organization_id)
JOIN total_sales ts ON ts.customer_id = c.id
JOIN performance_analyses pa ON pa.total_sales_id = ts.id
WHERE pc.organization_id = 3
AND c.active -- boolean can be used directly
AND c.visible
AND ts.product_category_id = pc.id
AND ts.period_id = 193
AND pa.size > 0;
Another answer advises to move all conditions into join clauses and order tables in the FROM
list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:
Explicit inner join syntax (
INNER JOIN
,CROSS JOIN
, or unadornedJOIN
) is semantically the same as listing the input relations inFROM
, so it does not constrain the join order.
Bold emphasis mine. There is more, read the manual.
The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE
or JOIN
clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)
The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.
Related:
- Sample Query to show Cardinality estimation error in PostgreSQL
- A: Slow fulltext search due to wildly inaccurate row estimates
Finally, WHERE id IN (<subquery>)
is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>)
is generally (at least a bit) slower and more verbose. Use joins.
Your query
Indexes
product_categories
has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop .index_product_categories_on_organization_id
customers
has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN
output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible
are constant predicates, I suggest a partial multicolumn index:
CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;
I appended id
to allow index-only scans. The column is otherwise useless in the index for this query.
total_sales
has 7,104,441 rows. Indexes are very important. I suggest:
CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)
Again, aiming for an index-only scan. This is the most important one.
You can delete the completely redundant index .index_total_sales_on_product_category_id
performance_analyses
has 1,012,346 rows. Indexes are very important.
I would suggest another partial index with the condition size > 0
:
CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;
However:
Rows Removed by Filter: 0"
Seems like this conditions serves no purpose? Are there any rows with size > 0
is not true?
After creating these indexes you need to ANALYZE
the tables.
Tables statistics
Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables
performance_analyses
and total_sales
.
You already did run VACUUM
and ANALYZE
, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL
on these two tables once (if you can afford an exclusive lock). Else try pg_repack
.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!
While being at it, have a look at the cost settings of your DB as well. Related:
- Keep PostgreSQL from sometimes choosing a bad query plan
- Postgres Slow Queries - Autovacuum frequency
回答2:
Although theoretically the optimizer should be able to do this, I often find that these changes can massively improve performance:
- use proper joins (instead of
where id in (select ...)
) - order the reference to tables in the
from
clause such that the fewest rows are returned at each join, especially the first table's condition (in the where clause) should be the most restrictive (and should use indexes) - move all conditions on joined tables into the
on
condition of joins
Try this (aliases added for readability):
select count(*)
from total_sales ts
join product_categories pc on ts.product_category_id = pc.id and pc.organization_id = 3
join customers c on ts.customer_id = c.id and c.organization_id = 3
join performance_analyses pa on ts.id = pa.total_sales_id and pa.size > 0
where ts.period_id = 193
You will need to create this index for optimal performance (to allow an index-only scan on total_sales):
create index ts_pid_pcid_cid on total_sales(period_id, product_category_id, customer_id)
This approach first narrows the data to a period, so it will scale (remain roughly constant) into the future, because the number of sales per period will be roughly constant.
回答3:
The estimations there are not accurate. Postgres's planner uses wrongly nested loop - try to penalize nest_loop by statement set enable_nestloop to off
.
来源:https://stackoverflow.com/questions/38235142/count-on-join-of-big-tables-with-conditions-is-slow