How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

前端未结

关注

 3  755

感情败类 2021-02-01 11:12

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:

from sklearn


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   说谎
                                             
                
                
                (楼主)
            
              
              
                2021-02-01 12:06
              

            
            
                        
Even though X is sparse, X * X.T probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * X.T to be sparse (remove many common words)

In general you won't be able to compute Gram matrix of big data, unless you enforce the sparsity of the X * X.T, so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat  as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff  > T
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复