Hierarchical clustering of 1 million objects

后端未结
关注
 2  1034
轮回少年 2021-01-30 14:29
Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.
hcluster

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2021-01-30 15:27
              

            
            
                        
To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...

Two possible approaches:


build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.


How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?

What software are you using,
and how many hours / days do you have to do the clustering ?

For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d -- not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?

Take a look at scikit-learn clustering --
it has several methods, including DBSCAN.

Added: see also

google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007

SO hierarchical-clusterization-heuristics
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复