Query performance in PostgreSQL using 'similar to'

前端 未结 4 1014
伪装坚强ぢ
伪装坚强ぢ 2021-01-20 14:00

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:

select *
from t         


        
相关标签:
4条回答
  • 2021-01-20 14:13

    I agree with @Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.

    My additional answer is triggered by your statement:

    I can't find a better approach than using similar to.
    

    If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.

    Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.

    More details in this related answer on dba.SE.

    0 讨论(0)
  • 2021-01-20 14:17

    This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.

    If so, I'd want to remodel this table to use one of the following approaches:

    • Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.

    • Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.

    • Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;

    • Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.

    Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.

    0 讨论(0)
  • 2021-01-20 14:25

    If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:

    CREATE INDEX
            ix_tablename_columnxlist
    ON      tableName
    USING   GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
    

    then search against the index:

    SELECT  *
    FROM    tableName
    WHERE   REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
    
    0 讨论(0)
  • 2021-01-20 14:27

    I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?

    If there are too many attributes to assign a column each, I might try creating an "attribute" table:

    (fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
    

    and let the DBMS worry about the optimization.

    0 讨论(0)
提交回复
热议问题