Faster way to remove stop words in Python

前端未结

关注

 4  712

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-04 09:49
              
            
            
                                                                       
Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)


This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace.  I propose to handle this separately.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-04 09:52
              
            
            
                                                                       
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()


I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls   Cumulative Time

10000    7.723    words.py:7(testFuncOld)

10000    0.140    words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-04 09:52
              
            
            
                                                                       
First, you're creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))


Later, get rid of [] inside join. Use generator instead.

' '.join([x for x in ['a', 'b', 'c']])


replace to 

' '.join(x for x in ['a', 'b', 'c'])


Next thing to deal with would be to make .split() yield values instead of returning an array. I believe regex would be good replacement here. See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2020-12-04 09:55
              
            
            
                                                                       
Sorry for late reply.
Would prove useful for new users.


Create a dictionary of stopwords using collections library
Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复