问题
I want to exclude 'The', 'They' and 'My' from being displayed in my word cloud. I'm using the python library 'wordcloud' as below, and updating the STOPWORDS list with these 3 additional stopwords, but the wordcloud is still including them. What do I need to change so that these 3 words are excluded?
The libraries I imported are:
import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
I've tried adding elements to the STOPWORDS set at follows but, even though the words are added successfully, the wordcloud still shows the 3 words I added to the STOPWORDS set:
len(STOPWORDS)
Outputs: 192
Then I ran:
STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')
Then I ran:
len(STOPWORDS)
Outputs: 195
I'm running python version 3.7.3
I know I could amend the text input to remove the 3 words (rather than trying to amend WordCloud's STOPWORDS set) before running the wordcloud but I was wondering if there's a bug with WordCloud or whether I'm not updating/using STOPWORDS correctly?
回答1:
The default for a Wordcloud is that collocations=True
, so frequent phrases of two adjacent words are included in the cloud - and importantly for your issue, with collocations the removal of stopwords is different, so that for example “Thank you” is a valid collocation and may appear in the generated cloud even though “you” is in the default stopwords. Collocations which contain only stopwords are removed.
The not unreasonable-sounding rationale for this is that if stopwords were removed before building the list of collocations then e.g. “thank you very much” would provide “thank very” as a collocation, which I definitely wouldn’t want.
So to get your stopwords to work perhaps how you expect, i.e. no stopwords at all appear in the cloud, you could use collocations=False
like this:
my_wordcloud = WordCloud(
stopwords=my_stopwords,
background_color='white',
collocations=False,
max_words=10).generate(all_tweets_as_one_string)
UPDATE:
- With collocations False, stopwords are all lowercased for comparison with lowercased text when removing them - so no need to add 'The' etc.
- With collocations True (the default) while stopwords are lowercased, when looking for all-stopwords collocations to remove them, the source text isn't lower-cased so e.g.g
The
in the text isn't removed whilethe
is removed - that's why @Balaji Ambresh's code works, and you'll see that there are no caps in the cloud. This might be a defect in Wordcloud, not sure. However adding e.g.The
to stopwords won't affects this because stopwords is always lowercased regardless of collocations True/False
This is all visible in the source code :-)
For example with the default collocations=True
I get:
And with collocations=False
I get:
Code:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(collocations=False,
background_color='white',
max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
回答2:
pip install nltk
Don't forget to install stopwords.
python
>>> import nltk
>>> nltk.download('stopwords')
Give this a shot:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
background_color='white',
max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
来源:https://stackoverflow.com/questions/61953788/why-are-stop-words-not-being-excluded-from-the-word-cloud-when-using-pythons-wo