Faster way to remove stop words in Python

前端 未结 4 712
情歌与酒
情歌与酒 2020-12-04 09:34

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in          


        
相关标签:
4条回答
  • 2020-12-04 09:49

    Use a regexp to remove all words which do not match:

    import re
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
    text = pattern.sub('', text)
    

    This will probably be way faster than looping yourself, especially for large input strings.

    If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

    0 讨论(0)
  • 2020-12-04 09:52

    Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

        from nltk.corpus import stopwords
    
        cachedStopWords = stopwords.words("english")
    
        def testFuncOld():
            text = 'hello bye the the hi'
            text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
    
        def testFuncNew():
            text = 'hello bye the the hi'
            text = ' '.join([word for word in text.split() if word not in cachedStopWords])
    
        if __name__ == "__main__":
            for i in xrange(10000):
                testFuncOld()
                testFuncNew()
    

    I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

    nCalls Cumulative Time

    10000 7.723 words.py:7(testFuncOld)

    10000 0.140 words.py:11(testFuncNew)

    So, caching the stopwords instance gives a ~70x speedup.

    0 讨论(0)
  • 2020-12-04 09:52

    First, you're creating stop words for each string. Create it once. Set would be great here indeed.

    forbidden_words = set(stopwords.words('english'))
    

    Later, get rid of [] inside join. Use generator instead.

    ' '.join([x for x in ['a', 'b', 'c']])
    

    replace to

    ' '.join(x for x in ['a', 'b', 'c'])
    

    Next thing to deal with would be to make .split() yield values instead of returning an array. I believe regex would be good replacement here. See thist hread for why s.split() is actually fast.

    Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

    0 讨论(0)
  • 2020-12-04 09:55

    Sorry for late reply. Would prove useful for new users.

    • Create a dictionary of stopwords using collections library
    • Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

      from collections import Counter
      stop_words = stopwords.words('english')
      stopwords_dict = Counter(stop_words)
      text = ' '.join([word for word in text.split() if word not in stopwords_dict])
      
    0 讨论(0)
提交回复
热议问题