How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

前端未结

关注

 3  753

感情败类 2021-02-01 11:12

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:

from sklearn


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   南笙
                                             
                
                
                (楼主)
            
              
              
                2021-02-01 12:01
              

            
            
                        
What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.

It is still the same amount of calculation work but you wouldn't run out of memory.

Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复