I am trying to remove stopwords from a string of text:
from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in
Use a regexp to remove all words which do not match:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)
This will probably be way faster than looping yourself, especially for large input strings.
If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords])
if __name__ == "__main__":
for i in xrange(10000):
testFuncOld()
testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
nCalls Cumulative Time
10000 7.723 words.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
First, you're creating stop words for each string. Create it once. Set would be great here indeed.
forbidden_words = set(stopwords.words('english'))
Later, get rid of []
inside join
. Use generator instead.
' '.join([x for x in ['a', 'b', 'c']])
replace to
' '.join(x for x in ['a', 'b', 'c'])
Next thing to deal with would be to make .split()
yield values instead of returning an array. I believe See thist hread for why regex
would be good replacement here.s.split()
is actually fast.
Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.
Sorry for late reply. Would prove useful for new users.
Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))
from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])