Postgres multi-column index (integer, boolean, and array)

问题

I have a Postgres 9.4 database with a table like this:

| id | other_id | current | dn_ids                                | rank |
|----|----------|---------|---------------------------------------|------|
| 1  | 5        | F       | {123,234,345,456,111,222,333,444,555} | 1    |
| 2  | 7        | F       | {123,100,200,900,800,700,600,400,323} | 2    |

(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:

CREATE TABLE mytable (
    id integer NOT NULL,
    other_id integer,
    rank integer,
    current boolean DEFAULT false,
    dn_ids integer[] DEFAULT '{}'::integer[]
);

CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;

ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);

CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);

I need to optimize queries like this:

SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000

This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.

Here's an example EXPLAIN:

Limit  (cost=36944.53..36945.78 rows=500 width=65)
  ->  Sort  (cost=36942.03..37007.42 rows=26156 width=65)
        Sort Key: rank
        ->  Seq Scan on mytable  (cost=0.00..35431.42 rows=26156 width=65)
              Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))

Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.

What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?

回答1:

What's the right type of compound index to use for this query? I don't care about space at all.

This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:

Difference between GiST and GIN index

You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).

Multicolumn index on 3 fields with heterogenous data types

However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.

And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.

~~Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).~~
Not applicable to updated question.

Partial indexes might be an option. Other than that, you already have all the indexes you need.

I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.

Does it matter much how I order the terms in the WHERE clause?

The order of WHERE conditions is completely irrelevant.

Addendum after question update

The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.

For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.

Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.

Even with your added information, much of what's relevant is still in the dark. I am going to assume that:

Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
_{Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;}

While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:

CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;

The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:

CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;

GiST indexes may become more interesting in Postgres 9.5 due to this new feature:

Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.

Auto increment SQL function

回答2:

Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be

 CREATE INDEX index_mytable_on_other_id_and_current
           ON mytable (other_id, rank) WHERE current = F;

This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.

Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.

Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).