How to find set of most frequently occurring word-pairs in a file using python?

前端未结

关注

 2  1184

不要未来只要你来 2021-01-03 02:37

I have a data set as follows:

\"485\",\"AlterNet\",\"Statistics\",\"Estimation\",\"Narnia\",\"Two and half men\"
\"717\",\"I like Sheen\", \"Narnia\", \"Stat


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   清酒与你
                                             
                
                
                (楼主)
            
              
              
                2021-01-03 03:21
              

            
            
                        
There is not that much you can do, except counting all pairs.

Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).


If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).

It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.
    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复