问题
There are lots of questions on this topic, but all of them seem to be more complex cases than what I'm looking at at the moment and the answers don't seem applicable.
OHDSI=> \d record_counts
Table "results2.record_counts"
Column | Type | Modifiers
------------------------+-----------------------+-----------
concept_id | integer |
schema | text |
table_name | text |
column_name | text |
column_type | text |
descendant_concept_ids | bigint |
rc | numeric |
drc | numeric |
domain_id | character varying(20) |
vocabulary_id | character varying(20) |
concept_class_id | character varying(20) |
standard_concept | character varying(1) |
Indexes:
"rc_dom" btree (domain_id, concept_id)
"rcdom" btree (domain_id)
"rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)
The table has 3,133,778 records, so Postgres shouldn't be ignoring the index because of small table size.
I filter on domain_id, which is indexed, and the index is ignored:
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on record_counts (cost=0.00..76744.81 rows=2079187 width=87)
Filter: ((domain_id)::text = 'Drug'::text)
I turn off seqscan and:
OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on record_counts (cost=42042.13..105605.97 rows=2079187 width=87)
Recheck Cond: ((domain_id)::text = 'Drug'::text)
-> Bitmap Index Scan on rcdom (cost=0.00..41522.33 rows=2079187 width=0)
Index Cond: ((domain_id)::text = 'Drug'::text)
Indeed, the plan says it's going to be more expensive to use the index than not, but why? If the index lets it handle many fewer records, shouldn't it be quicker to use it?
回答1:
Ok, it looks like Postgres knew what it was doing. The particular value of the indexed column I was using ('Drug') happened to account for 66% of the rows in the table. So, yes, the filter makes the row set significantly smaller, but since those rows would be scattered between pages, the index doesn't allow them to be retrieved faster.
OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
domain_id | rows | pct
---------------------+---------+---------
Drug | 2074991 | 66.2137
Condition | 466882 | 14.8984
Observation | 217807 | 6.9503
Procedure | 165800 | 5.2907
Measurement | 127239 | 4.0602
Device | 29410 | 0.9385
Spec Anatomic Site | 28783 | 0.9185
Meas Value | 10415 | 0.3323
Unit | 2350 | 0.0750
Type Concept | 2170 | 0.0692
Provider Specialty | 1957 | 0.0624
Specimen | 1767 | 0.0564
Metadata | 1689 | 0.0539
Revenue Code | 538 | 0.0172
Place of Service | 480 | 0.0153
Race | 467 | 0.0149
Relationship | 242 | 0.0077
Condition/Obs | 182 | 0.0058
Currency | 180 | 0.0057
Condition/Meas | 115 | 0.0037
Route | 81 | 0.0026
Obs/Procedure | 78 | 0.0025
Condition/Device | 52 | 0.0017
Condition/Procedure | 25 | 0.0008
Meas/Procedure | 25 | 0.0008
Gender | 19 | 0.0006
Device/Procedure | 9 | 0.0003
Meas Value Operator | 9 | 0.0003
Visit | 8 | 0.0003
Drug/Procedure | 3 | 0.0001
Spec Disease Status | 3 | 0.0001
Ethnicity | 2 | 0.0001
When I use any other value in the where clause (including 'Condition', with 15% of the rows), Postgres does use the index.
(Somewhat surprisingly, even after I cluster the table based on the domain_id index, it still doesn't use the index when I filter on 'Drug', but the performance improvement for filtering out 34% of the rows doesn't seem worth pursuing this further.)
来源:https://stackoverflow.com/questions/41806803/postgres-not-using-index