I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:
print len(set(w.lower()
You can calculate the number of items in a set, list or tuple all the same with len(my_set)
or len(my_list)
.
Edit: Calculating the numbers of times a word is used, is something different.
Here the obvious approach:
count = {}
for w in open('filename.dat').read().split():
if w in count:
count[w] += 1
else:
count[w] = 1
for word, times in count.items():
print "%s was found %d times" % (word, times)
If you want to avoid the if-clause, you can look at collections.defaultdict.
I believe that Counter is all that you need in this case:
from collections import Counter
print Counter(yourtext.split())
A set, by definition, contains unique elements (in your case, you can't find the same 'lower cased string' twice there). So, what you have to do is simply get the count of elements in the set = the length of the set = len(set(...))
Your question already contains the answer. If s
is the set of unique words in the document, then len(s)
gives the number of elements in the set, i.e. the number of unique words in the document.
You can use Counter
from collections import Counter
c = Counter(['mama','papa','mama'])
The result of c will be
Counter({'mama': 2, 'papa': 1})
I would say that that code counts the number of distinct words, not the number of unique words, which is the number of words which occur only once.
This counts the number of times that each word occurs:
from collections import defaultdict
word_counts = defaultdict(int)
for w in open('filename.dat').read().split():
word_counts[w.lower()] += 1
for w, c in word_counts.iteritems():
print w, "occurs", word_counts[w], "times"