Count on join of big tables with conditions is slow

问题

This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN results.

SELECT
  COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
  (size > 0) AND
  total_sales.customer_id IN (
    SELECT customers.id FROM customers WHERE customers.active = 't'
    AND customers.visible = 't' AND customers.organization_id = 3
  ) AND
  total_sales.product_category_id IN (
    SELECT product_categories.id FROM product_categories
    WHERE product_categories.organization_id = 3
  ) AND
  total_sales.period_id = 193;

I've tried both the approach of INNER JOIN'ing customers and product_categories tables and doing an INNER SELECT. Both had the same time.

Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr

Postgres version:

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

Tables and indexes:

CREATE TABLE total_sales (
  id serial NOT NULL,
  value double precision,
  start_date date,
  end_date date,
  product_category_customer_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  processed boolean,
  customer_id integer,
  product_category_id integer,
  period_id integer,
  CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);


CREATE TABLE performance_analyses (
  id serial NOT NULL,
  total_sales_id integer,
  status_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  size double precision,
  period_size integer,
  nominal_variation double precision,
  percentual_variation double precision,
  relative_performance double precision,
  time_ago_max integer,
  deseasonalized_series text,
  significance character varying,
  relevance character varying,
  original_variation double precision,
  last_level double precision,
  quantiles text,
  range text,
  analysis_method character varying,
  CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);


CREATE TABLE product_categories (
  id serial NOT NULL,
  name character varying,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  external_id character varying,
  CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);


CREATE TABLE customers (
  id serial NOT NULL,
  name character varying,
  external_id character varying,
  region_id integer,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  active boolean DEFAULT false,
  visible boolean DEFAULT false,
  segment_id integer,
  "group" boolean,
  group_id integer,
  ticket_enabled boolean DEFAULT true,
  CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);    
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);

Rows counts:

customers - 6,970 rows
product_categories - 34 rows
performance_analyses - 1,012,346 rows
total_sales - 7,104,441 rows

回答1:

Your query, rewritten and 100 % equivalent:

SELECT count(*)
FROM   product_categories   pc 
JOIN   customers            c  USING (organization_id) 
JOIN   total_sales          ts ON ts.customer_id = c.id
JOIN   performance_analyses pa ON pa.total_sales_id = ts.id
WHERE  pc.organization_id = 3
AND    c.active  -- boolean can be used directly
AND    c.visible
AND    ts.product_category_id = pc.id
AND    ts.period_id = 193
AND    pa.size > 0;

Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.

Bold emphasis mine. There is more, read the manual.

The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)

The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.

Sample Query to show Cardinality estimation error in PostgreSQL
A: Slow fulltext search due to wildly inaccurate row estimates

Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.

Your query

Indexes

product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop ~~index_product_categories_on_organization_id~~.

customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

I appended id to allow index-only scans. The column is otherwise useless in the index for this query.

total_sales has 7,104,441 rows. Indexes are very important. I suggest:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

Again, aiming for an index-only scan. This is the most important one.

You can delete the completely redundant index ~~index_total_sales_on_product_category_id~~.

performance_analyses has 1,012,346 rows. Indexes are very important. I would suggest another partial index with the condition size > 0:

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

However:

Rows Removed by Filter: 0"

Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?

After creating these indexes you need to ANALYZE the tables.

Tables statistics

Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables performance_analyses and total_sales.

You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!

While being at it, have a look at the cost settings of your DB as well. Related:

Keep PostgreSQL from sometimes choosing a bad query plan
Postgres Slow Queries - Autovacuum frequency

回答2:

Although theoretically the optimizer should be able to do this, I often find that these changes can massively improve performance:

use proper joins (instead of where id in (select ...))
order the reference to tables in the from clause such that the fewest rows are returned at each join, especially the first table's condition (in the where clause) should be the most restrictive (and should use indexes)
move all conditions on joined tables into the on condition of joins

Try this (aliases added for readability):

select count(*)
from total_sales ts
join product_categories pc on ts.product_category_id = pc.id and pc.organization_id = 3
join customers c on ts.customer_id = c.id and c.organization_id = 3
join performance_analyses pa on ts.id = pa.total_sales_id and pa.size > 0
where ts.period_id = 193

You will need to create this index for optimal performance (to allow an index-only scan on total_sales):

create index ts_pid_pcid_cid on total_sales(period_id, product_category_id, customer_id)

This approach first narrows the data to a period, so it will scale (remain roughly constant) into the future, because the number of sales per period will be roughly constant.

回答3:

The estimations there are not accurate. Postgres's planner uses wrongly nested loop - try to penalize nest_loop by statement set enable_nestloop to off.

来源：https://stackoverflow.com/questions/38235142/count-on-join-of-big-tables-with-conditions-is-slow

标签

sql

postgresql

indexing

count

postgresql-performance