ArangoDB Index usage with edge collections

问题

Task: Fastest way to update many edges attributes. For performance reasons, I am ignore graph methods and work with collection directly for filtering.

ArangoDB 2.8b3

Query [Offer - edge collection]:

FOR O In Offer
FILTER O._from == @from and O._to == @to and O.expired > DATE_TIMESTAMP(@newoffertime)
UPDATE O WITH { expired: @newoffertime } IN Offer
RETURN { _key: OLD._key, prices_hash: OLD.prices_hash }

I have system index on _to, _from and range index on expired

Query explain show

7   edge   Offer        false    false        49.51 %   [ `_from`, `_to` ]   O.`_to` == "Product/1023058135528"

System index used for filtering only part of records (_to), not for both (_from, _to), 'expired' index also not used. Please explain me the reasons for this behavior, and there is a possibility to specify hint of indices to be used for the shortest path, if I know for sure when planning data model?

回答1:

For filter conditions combined with logical ANDs as in your query, ArangoDB's query optimizer will pick a single index. This is the reason why it hasn't picked the edge index and the skiplist index at the same time.

It will do a selection between the skiplist index on expired and the edge index on [ "_from", "_to" ], and will pick the one for which it determines the lower cost, which is measured by index selectivity estimates. As the explain output shows, it seems to have picked the edge index on _to.

The edge index internally consists of two separate hash indexes, one on the _from attribute and one on the _to attribute, so it allows quick access via both the _from and the _to attributes. However, it's not a combined index on [ "_from", "_to" ], so it does not support queries that ask for _from and _to at the same time. It has to pick one of the internal hash indexes, and seems to have picked the one on _to in that query. The decision is based on average index selectivity again.

There is no way to provide any index usage hint to the optimizer - apart from that, it wouldn't be able to use two indexes at the same time for this particular query.

Looking at the selectivity estimate in the explain output, it seems that the edge index is not very selective, meaning there'll be lots of edges with the same _to values. As the optimizer should have also taken into account the index on _from, I would assume that index is even less selective, and that each of these indexes will only help to skip at most 50 % of the edges, which is not very much. If that's actually the case, then the query will still retrieve (and post-filter) a lot of documents, explaining potential slowness.

At the moment the attributes _from and _to are automatically indexed in an edge collection's always-present edge index, and they cannot be used in additional, user-defined indexes. This is a feature that we would like to add in a future release, because with _from and _to being accessible for user-defined indexes, one could create a combined (sorted) index on [ "_from", "_to", "expired" ] which would be potentially much more selective than any of the three single-attribute indexes in isolation.

来源：https://stackoverflow.com/questions/34558463/arangodb-index-usage-with-edge-collections

标签

performance

indexing

arangodb

aql