问题
This gives me a frequency of words in a text:
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords :
d[word] += 1
finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True)
self.response.out.write(finalFreq)
This also gives me useless words like "the" "an" "a"
My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine
回答1:
You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv
format, easily read with the csv
module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.
回答2:
There's an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):
stopWords = set(['a', 'an', 'the', ...])
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords:
if word not in stopWords:
d[word] += 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
self.response.out.write(finalFreq)
This approach constructs the sorted list in two steps: first it filters out any words in your desired list of "stop words" (which has been converted to a set
for efficiency), then it sorts the remaining entries.
回答3:
I know that NLTK has a package with a corpus and the stopwords for many languages, including English, see here for more information. NLTK has also a word frequency counter, it's a nice module for natural language processing that you should consider to use.
回答4:
stopwords = set(['an', 'a', 'the']) # etc...
finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords,
key = operator.itemgetter(1), reverse=True)
This will filter out any keys which are in the stopwords
set.
来源:https://stackoverflow.com/questions/3173592/word-frequency-in-text-using-python-but-disregard-stop-words