Indexing jsonb for numeric comparison of fields

问题

I've defined a simple table with

create table resources (id serial primary key, fields jsonb);

And it contains data with keys (drawn from a large set) and values between 1 and 100, like:

   id   |    fields                                                                                                 
--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      1 | {"tex": 23, "blair": 46, "cubic": 50, "raider": 57, "retard": 53, "hoariest": 78, "suturing": 25, "apostolic": 22, "unloosing": 37, "flagellated": 85}
      2 | {"egoist": 75, "poshest": 0, "annually": 19, "baptists": 29, "bicepses": 10, "eugenics": 9, "idolizes": 8, "spengler": 60, "scuppering": 13, "cliffhangers": 37}
      3 | {"entails": 27, "hideout": 22, "horsing": 98, "abortions": 88, "microsoft": 37, "spectrums": 26, "dilettante": 52, "ringmaster": 84, "floweriness": 72, "vivekananda": 24}
      4 | {"wraps": 6, "polled": 68, "coccyges": 63, "internes": 93, "unburden": 61, "aggregate": 76, "cavernous": 98, "stylizing": 65, "vamoosing": 35, "unoriginal": 40}
      5 | {"villon": 95, "monthly": 68, "puccini": 30, "samsung": 81, "branched": 33, "congeals": 6, "shriller": 47, "terracing": 27, "patriarchal": 86, "compassionately": 94}

I'd like to search for entries whose value (associated with a particular key) is greater than some benchmark value. I can accomplish this, for example via:

with exploded as (
    select id, (jsonb_each_text(fields)).*
    from resources)
select distinct id
    from exploded
    where key='polled' and value::integer>50;

... but of course this does not use an index, and it resorts to a table scan. I wonder if there is:

A more efficient way to query for resources with "polled" >50
A way to build indexes that will support this kind of query

回答1:

You haven't specified what kind of INDEX you are expecting to be used and you haven't provided a definition of it.

The typical INDEX for a jsonb field would be a GIN one, but in your specific case you need to actually compare some values contained in the polled key.

Maybe a specific INDEX (though not a GIN one!) with an expression could be of some use, but I doubt it and it could get quite cumbersome since you would need at least a double type cast to obtain an integer value and a custom IMMUTABLE function to actually perform the type casts in your CREATE INDEX statement.

Before taking a complicated route which would solve just some specific cases (what if you'd need another comparison with a different fields key?), you could try to optimize your current query, taking advantage of PostgreSQL 9.4 new LATERAL capabilities and jsonb processing functions. The result is a query that should run up to 8 times faster than your current one:

SELECT r.id 
    FROM resources AS r,
    LATERAL jsonb_to_record(r.fields) AS l(polled integer) 
    WHERE l.polled > 50;

EDIT :

I did a quick test to put in practice the idea in my comment to use a GIN INDEX to restrict the number of rows before actually comparing the values, and it turned out you can really make some use of a GIN INDEX even in that situation.

The INDEX must be created with the default operator class jsonb_ops (not the lighter and more performing jsonb_path_ops) :

CREATE INDEX ON resources USING GIN (fields);

Now you can take advantage of the index simply including an exist ? test in the query:

SELECT r.id
    FROM resources AS r,
    LATERAL jsonb_to_record(r.fields) AS l(polled integer) 
    WHERE r.fields ? 'polled' AND l.polled > 50;

The query now performs about 3 times faster (which is about 20 times faster than the first CTE version). I've tested with up to 1M rows and the performance gain is always the same.

Keep in mind that, as expected, the number of rows plays an important role: with less than 1K rows the index is quite useless and the query planner probably will not use it.

Also don't forget the jsonb_ops index can become huge compared to the actual data. With a data sample like yours, ranging from 1K to 1M rows, the index itself is about 170% bigger than the actual data in the table, check it yourself:

SELECT pg_size_pretty(pg_total_relation_size('resources')) AS whole_table, 
       pg_size_pretty(pg_relation_size('resources')) AS data_only, 
       pg_size_pretty(pg_relation_size('resources_fields_idx')) AS gin_index_only;

Just to give you an idea, with about 300K rows like your data sample, the table is about 250MB, consisting of 90MB of data and 160MB of index! Personally, I would stick (and I actually do) with a simple LATERAL JOIN without an index.

来源：https://stackoverflow.com/questions/30089350/indexing-jsonb-for-numeric-comparison-of-fields

标签

json

postgresql

postgresql-9.4

jsonb