ArangoDB: performance index in array element

问题

I have a Collection in ArangoDB populated with element like this:

{

  "id": "XXXXXXXX",
  "relation": [
    {
      "AAAAA": "AAAAA",
    },
    {
      "BBBB": "BBBBBB",
      "field": {
        "v1": 0,
        "v2": 0,
        "v3": 0
      }
    },
    {
      "CCCC": "CCCC",
      "field": {
        "v1": 0,
        "v2": 1,
        "v3": 2
      }
    },
  ]
}

I want to return only elements that have field.v1 > 0 (or a combination of v values). I've tried to write an AQL query like this one, but it doesn't use indexes and it is so slow with 200000+ elements.

FOR a in X
    FILTER LENGTH(a.relation) > 0
    LET relation =  a.relation
    FOR r in relation
        FILTER r.field > null 
        FILTER r.field.v1 > 0
return a

I've tried to create these indexes:

full text on relation[*]field
skip list on relation[*]field
hash on relation[*]field but with no result.

What can I do? Can you suggest me any changes to the query?

Thanks.

Best regards,

Daniele

回答1:

I suggest the following changes, but they won't speed up the query noticeably:

the filters FILTER r.field > null and FILTER r.field.v1 > 0 are redundant. You can just use the latter FILTER r.field.v1 > 0 and omit the other filter condition
the auxiliary variable LET relation = a.relation is defined after a.relation is used in the LENGTH(a.relation) calculation. If the auxiliary variable would be defined before the LENGTH() calculation, it could be used inside it like this: LET relation = a.relation FILTER LENGTH(relation) > 0. This will save a bit of processing time
the original query checks each v1 value and may return each document multiple times if multiple v1 values in a document satisfy the filter condition. That means the original query may return more documents than there are actually present in the collection. If that's not desired, I suggest using a subquery (see below)

When applying the above modifications to the original query, this is what I came up with:

FOR a IN X 
  LET relation = a.relation
  FILTER LENGTH(relation) > 0 
  LET s = (
    FOR r IN relation
      FILTER r.field.v1 > 0 
      LIMIT 1 
      RETURN 1
  )
  FILTER LENGTH(s) > 0 
  RETURN a

As I said this probably won't improve performance greatly, however, you may get a different (potentially the desired) result from the query, i.e. less documents if multiple v1 in a document satisfy the filter condition.

Regarding indexes: fulltext and hash indexes will not help here as they support only equality comparisons, but the query's filter conditions is a greater than. The only index type that could be beneficial here in general would be the skiplist index. However, indexing array values is not supported in 2.7 at all, so indexing relation[*].field won't help and still no index will be used as you reported.

ArangoDB 2.8 will be the first version that supports indexing individual array values, and there you could create an index on relation[*].field.v1.

Still the query in 2.8 won't use that index because the array indexes are only used for the IN comparison operator. They cannot be used with a > as in the query. Additionally, when writing the filter condition as FILTER r[*].field.v1 > 0, this would evaluate to FILTER [null, 0, 0] > 0 for the example document above, which will not produce the desired results.

What could help here is a comparison operator modificator (working title) that could tell the operators <, <=, >, >=, ==, != to run the comparison on all members of its left operand. There could be ALL and ANY modifications, so that the filter condition could be written as simply FILTER a.relation[*].field.v1 ANY > 0. But please note that this is not an existing feature yet, but only my quick draft for how this could be fixed in the future.

回答2:

Fulltext indes currently can only be used with the FULLTEXT() function.

Its currently not possible to use indices for determining the length of sub objects. This would be somthing one could solve using function defined indices once they would become real.

Right now the only way to get a useable performance for this would be to to remeber that length on another attribute while writing the documents into the collection:

{
  "id": "XXXXXXXX",
  "length": 6,
  "relation": [
    {
      "AAAAA": "AAAAA",
    },
    {
      "BBBB": "BBBBBB",
      "field": {
        "v1": 0,
        "v2": 0,
        "v3": 0
      }
    },
    {
      "CCCC": "CCCC",
      "field": {
        "v1": 0,
        "v2": 1,
        "v3": 2
      }
    },
  ]
}

<Clippy>you look like you want to be using graph features for your data layout? </Clippy>

来源：https://stackoverflow.com/questions/33921296/arangodb-performance-index-in-array-element

标签

performance

indexing

arangodb