Neo4j Cypher query structure and performance optimization

后端 未结 1 1395
一向
一向 2020-12-21 21:02

I have created a Cypher query dynamic builder. For a complex cases this builder produces a quite big queries, for example:

MATCH (parentD)-[:CONTAINS]->(c         


        
相关标签:
1条回答
  • 2020-12-21 21:49

    The main problem with your query, is that you are basically doing a lot of checks, with rows running wild. So here are some tips to reduce how many rows you are generating at each MATCH.

    1) Unless you NEED duplicates, use WITH DISTINCT instead of just WITH. WITH can create duplicate rows (because you only cut off a column), and every duplicate row you process is wasted time and extra DB hits. (Namely, every filter column you drop adds duplicate rows)

    2) :Value.value is overloaded. It has no semantic meaning, and the value isn't even guaranteed to be any kind of type. That means every :Value check has to go out and touch a bunch of :Value nodes that have nothing to do with what your searching for. So as the number of attached :Value nodes increases, the more expensive it becomes to find the right one (This is less expensive if it could be indexed, so that it could just find the right :Value, and see what it is connected to. This doesn't help if you can't change the schema you're working with, and by schema, I mean how your data/relationships are setup).

    3) Only check what you need to check. It might seem more efficient to say (a:A)-[:TO]->(b:B), but if all [:TO] are from :A to :B, Neo4j now has to verify that the first node is an :A and the second node is a :B. Cypher doesn't know what is implicitly true, so it has to do the check, but each of these redundant checks has to go out and hit the DB for every row. So it is better to say (a)-[:TO]->(b).

    4) Limit variable scope. Here, you match -[ru:CREATED_BY]->(u:User) at the beginning but than don't use it til the end, with no filters. This multiplies how many rows you have by the number of -[ru:CREATED_BY]->(u:User) on each decision, that ALL have to be checked in the further matches. Unless -[ru:CREATED_BY]->(u:User) somehow greatly limits the matched decisions (or there can only be one per decision), match this support information at the end.

    5) Order your filters from strongest to weakest (if you can). to cut as many rows as early as possible.

    6) Tricks to minimize rows. Each row pulled up makes the following steps in the query have to work that much harder, so minimize rows in queries. If you are using OR to combine unrelated, but similar columns queries (like all orgs with conditions A or orgs with conditions B) and the work of the two queries just make things more expensive for the other half, it might be better to use UNION to combine the results of smaller, faster queries (and UNION can run in parallel up to the merge results). Note that simple queries like WHERE org.id in [1,2,3] are still faster than UNION, since the work can all be done in one lookup.

    Aside from union, if you are collecting nodes that you don't filter on, you can use collect(column) to reduce 'duplicates' down to 1 row, and than UNWIND (column) as column at the end of the query to get your rows back! (column here referring to variable name)

    7) Doing a lot of filters on 1 node? Cypher has USING hints for that! The hint USING JOIN ON column tells Cypher that it will probably be more efficient doing this match with more starting leafs and joining them. So using USING JOIN ON childD on each match will tell Cypher to do all the filters in parallel, and use the overlapping rows of all of them. Note that USINGs are just you telling Cypher "trust me, this should go faster if we try doing this" which can actually make the query worse if you are wrong. (USING JOIN should be useful though for making large queries more parallel though)

    UPDATE:
    First, a note on node.id = "constant" AND node.value = "constant" OR node.id = "constant2" AND node.value = "constant2" vs node.value = map[node.id]. The first query is able to do node filtering on node lookup, while the later has to filter through all of the nodes that where already looked up. Without previous filtering on that lookup, that means the map has to pull in all nodes. While the map offers some level of (arguable) simplicity/flexibility, it is one of the least efficient ways to filter nodes.

    Second, The big problem with your query now, is the :Value is super overloaded, and you aren't finding it by ID. :Value should be a relationship, or have an indexed ID field so that you don't have to touch ALL <-[:SET_FOR]- and -[:SET_ON]->. Using the Join hint I think will at least make SET_FOR higher priority, which appears to be the more efficient of the two.

    Here is my attempt to rewrite the PROFILE query more efficiently. (v1)

    MATCH (parentD)-[:CONTAINS]->(childD:Decision)
    WHERE id(parentD) = 415406 
    MATCH (childD)<-[:SET_FOR]-(filterValue416423)-[:SET_ON]->(filterCharacteristic416423)
    USING JOIN ON childD
    WHERE id(filterCharacteristic416423) = 416423
    WHERE ('Adobe RGB' IN filterValue416423.value ) OR ('ECI RGB' IN filterValue416423.value )
    WITH DISTINCT childD
    MATCH (childD)<-[:SET_FOR]-(filterValue416273)-[:SET_ON]->(filterCharacteristic416273) 
    USING JOIN ON childD
    WHERE id(filterCharacteristic416273) = 416273 AND (filterValue416273.value >= 4) AND (filterValue416273.value <= 53)
    WITH DISTINCT childD  
    MATCH (childD)<-[:SET_FOR]-(filterValue415431)-[:SET_ON]->(filterCharacteristic415431) 
    USING JOIN ON childD
    WHERE id(filterCharacteristic415431) = 415431
    WHERE ('Compact' IN filterValue415431.value ) 
      OR ('Compact SLR' IN filterValue415431.value ) 
      OR ('Large SLR' IN filterValue415431.value ) 
      OR ('Rangefinder-style mirrorless' IN filterValue415431.value ) 
      OR ('SLR-like (bridge)' IN filterValue415431.value )
    WITH DISTINCT childD
    MATCH (childD)<-[:SET_FOR]-(filterValue415441)-[:SET_ON]->(filterCharacteristic415441) 
    USING JOIN ON childD
    WHERE id(filterCharacteristic415441) = 415441
    WHERE  ('Brass' IN filterValue415441.value ) 
      OR ('Carbon fiber' IN filterValue415441.value )  
    
    OPTIONAL MATCH (childD)<-[:VOTED_FOR]-(vg:VoteGroup)-[:VOTED_ON]->(c:Criterion) 
    WHERE id(c) IN [415414, 415415, 415412, 415426, 415411]  
    
    WITH DISTINCT * MATCH (childD)-[ru:CREATED_BY]->(u:User)  
    WITH DISTINCT childD, ru, u, vg.avgVotesWeight as weight, vg.totalVotes as totalVotes 
    WITH DISTINCT ru, u, childD , toFloat(sum(weight)) as weight, toInt(sum(totalVotes)) as totalVotes  
    ORDER BY  weight DESC 
    SKIP 0 LIMIT 10 
    RETURN ru, u, childD AS decision, weight, totalVotes, 
    [ (parentD)<-[:DEFINED_BY]-(entity)<-[:COMMENTED_ON]-(comg:CommentGroup)-[:COMMENTED_FOR]->(childD) | 
      {entityId: id(entity),  types: labels(entity), totalComments: toInt(comg.totalComments)} ] AS commentGroups, 
    [ (parentD)<-[:DEFINED_BY]-(c1:Criterion)<-[:VOTED_ON]-(vg1:VoteGroup)-[:VOTED_FOR]->(childD) | 
      {criterionId: id(c1),  weight: vg1.avgVotesWeight, totalVotes: toInt(vg1.totalVotes)} ] AS weightedCriteria, 
    [ (parentD)<-[:DEFINED_BY]-(ch1:Characteristic)<-[:SET_ON]-(v1)-[:SET_FOR]->(childD) | 
      {characteristicId: id(ch1),  value: v1.value, valueType: ch1.valueType, visualMode: ch1.visualMode} ] AS valuedCharacteristics
    
    0 讨论(0)
提交回复
热议问题