Tuning Neo4j for Performance

后端未结

关注

 3  1929

I have imported data using Michael Hunger\'s Batch Import, through which I created:-

4,612,893 nodes
14,495,063 properties
    node properties are indexed.
5,300


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2021-01-31 22:19
              
            
            
                                                                       
You don't need:

WHERE NOT(a=b)


Two different identifiers are never the same node in a Pattern matcher.

Can you use profile with your queries?

profile START u=node(467242)
MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b
RETURN u,COUNT(b)


It would also be interesting to see how many nodes are touched:

profile START u=node(467242)
MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b
RETURN count(distinct a),COUNT(distinct b),COUNT(*)


You can also reduce your MMIO settings to the real values:

neostore.nodestore.db.mapped_memory=180M
neostore.relationshipstore.db.mapped_memory=750M


If you declare all of your machine's RAM as heap it will compete with FS-buffers and the mmio buffers.

wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000


Are you measuring the first run or subsequent runs of your queries?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-01-31 22:21
              
            
            
                                                                       
Running this on my macbook air with little RAM and CPU with your dataset. 

You will get much faster than my results with more memory mapping, GCR cache and more heap for caches. Also make sure to use parameters in your queries.

You are running into combinatorial explosion.

Every step of the path adds "times rels" elements/rows to your matched subgraphs.

See for instance here: you end up at 269268 matches but you only have 81674 distinct lu's

The problem is that for each row the next match is expanded. So if you use distinct in between to limit the sizes again it will be some order of magnitutes less data. 
Same for the next level.

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[r2:LIKED|COMMENTED]-lu
RETURN count(*),count(distinct a),count(distinct lu);

+---------------------------------------------------+
| count(*) | count(distinct a) | count(distinct lu) |
+---------------------------------------------------+
| 269268   | 1952              | 81674              |
+---------------------------------------------------+
1 row

895 ms

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[:LIKED|COMMENTED]-lu
WITH distinct lu
MATCH lu-[:LIKED]-b
RETURN count(*),count(distinct lu), count(distinct b)
;
+---------------------------------------------------+
| count(*) | count(distinct lu) | count(distinct b) |
+---------------------------------------------------+
| 2311694  | 62705              | 91294             |
+---------------------------------------------------+


Here you have 2.3M total matches and only 91k distinct elements. So almost 2 orders of magnitude.

This is a huge aggregation which is rather a BI / statistics query that an OLTP query. 
Usually you can store the results e.g. on the user-node and only re-execute this in the background.

THESE kind of queries are again global graph queries (statistics / BI ), in this case top 10 users.

Usually you would run these in the background (e.g. once per day or hour) and connect the top 10 user nodes to a special node or index that then can be queried in a few ms.

START a=node:nodes(kind="user") RETURN count(*);
+----------+
| count(*) |
+----------+
| 3889031  |
+----------+
1 row

27329 ms


After all you are running a match across the whole graph, i.e. 4M users that's a graph global, not a graph local query.

START n=node:nodes(kind="top-user")
MATCH n-[r?:TOP_USER]-()
DELETE r
WITH distinct n
START a=node:nodes(kind="user")
MATCH a-[:CREATED|LIKED|COMMENTED|FOLLOWS]-()
WITH n, a,count(*) as cnt
ORDER BY cnt DESC
LIMIT 10
CREATE a-[:TOP_USER {count:cnt} ]->n;

+-------------------+
| No data returned. |
+-------------------+
Relationships created: 10
Properties set: 10
Relationships deleted: 10

70316 ms


The querying would then be:

START n=node:nodes(kind="top-user")
MATCH n-[r:TOP_USER]-a
RETURN a, r.count
ORDER BY r.count DESC;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                                  | r.count |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
….
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows

4 ms

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-01-31 22:26
              
            
            
                                                                       
Okay, so first of all, for only 8GB of memory that's a very large graph. You should seriously consider getting a larger box. Neo4j actually provides an extremely nice hardware calculator that will let you determine exactly what is appropriate for your needs:

http://neotechnology.com/calculatorv2/


In a contrived way (since there are more relevant metrics to determining size) their calculator estimates should should be dedicating about 10GB at a minimum. 

Secondly, Neo4j and any graph database will have issues with nodes that have a large number of connections. If you're looking to tune your instance to perform better (after getting a bigger box) I would suggest looking for any massive nodes with a large number of connections as those will seriously impact performance.

After seeing your examples I'm quite certain you've got a graph with a number of nodes that have a much larger number of connections that other nodes. This will inherently slow down your performance. You might also try more narrow queries. Especially when you're already working on a server that's too small you don't want to run the kind of extremely taxing large return queries you've got there.

There are some things about your queries that could be cleaned up, but I really urge you to get the appropriately sized box for your graph and actually do some introspection into the number of connections your most connected nodes have.

It also looks like you have an artificial cap on your Java Heap size. If you try starting up java with a command like:

java -Xmx8g //Other stuff


You'll allocate 8 gigs instead of the standard ~500 Megs, which would also help.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复