Hierarchical clustering of 1 million objects

后端未结

关注

 2  1035

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-30 15:05
              
            
            
                                                                       
The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.

You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.

If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.

Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-30 15:27
              
            
            
                                                                       
To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...

Two possible approaches:


build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.


How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?

What software are you using,
and how many hours / days do you have to do the clustering ?

For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d -- not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?

Take a look at scikit-learn clustering --
it has several methods, including DBSCAN.

Added: see also

google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007

SO hierarchical-clusterization-heuristics
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复