问题
In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"
I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.
回答1:
You can use the .words()
function in the corpus to returns a list of strings (i.e. tokens/words):
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
Then use the Counter()
object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:
>>> wordcounts = Counter(brown.words())
But do note that the Counter is case-sensitive, see:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971
来源:https://stackoverflow.com/questions/22762893/nltk-function-to-count-occurrences-of-certain-words