As others already mentioned, using nltk
would be your best option if you want something stable, and scalable. It's highly configurable.
However, it has the downside of having a quite steep learning curve, if you want to tweak the defaults.
I once encountered a situation where I wanted to have a bag of words. Problem was, it concerned articles about technologies with exotic names full of -
, _
, etc. Such as vue-router
or _.js
etc.
The default configuration of nltk's word_tokenize
is to split vue-router
into two separate vue
and router
words, for instance. I'm not even talking about _.js
.
So for what it's worth, I ended up writing this little routine to get all the words tokenized into a list
, based on my own punctuation criteria.
import re
punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]
print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']
This routine can be easily combined with Patty3118 answer about collections.Counter
, which could lead you to know which number of times _.js
was mentioned in the article, for instance.