Counting the number of unique words in a document with Python

后端 未结 6 1672
一个人的身影
一个人的身影 2020-12-02 00:05

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower()         


        
相关标签:
6条回答
  • 2020-12-02 00:36

    You can calculate the number of items in a set, list or tuple all the same with len(my_set) or len(my_list).

    Edit: Calculating the numbers of times a word is used, is something different.
    Here the obvious approach:

    count = {}
    for w in open('filename.dat').read().split():
        if w in count:
            count[w] += 1
        else:
            count[w] = 1
    for word, times in count.items():
        print "%s was found %d times" % (word, times)
    

    If you want to avoid the if-clause, you can look at collections.defaultdict.

    0 讨论(0)
  • 2020-12-02 00:46

    I believe that Counter is all that you need in this case:

    from collections import Counter
    
    print Counter(yourtext.split())
    
    0 讨论(0)
  • 2020-12-02 00:47

    A set, by definition, contains unique elements (in your case, you can't find the same 'lower cased string' twice there). So, what you have to do is simply get the count of elements in the set = the length of the set = len(set(...))

    0 讨论(0)
  • 2020-12-02 00:47

    Your question already contains the answer. If s is the set of unique words in the document, then len(s) gives the number of elements in the set, i.e. the number of unique words in the document.

    0 讨论(0)
  • 2020-12-02 00:48

    You can use Counter

    from collections import Counter
    c = Counter(['mama','papa','mama'])
    

    The result of c will be

    Counter({'mama': 2, 'papa': 1})
    
    0 讨论(0)
  • 2020-12-02 00:53

    I would say that that code counts the number of distinct words, not the number of unique words, which is the number of words which occur only once.

    This counts the number of times that each word occurs:

    from collections import defaultdict
    
    word_counts = defaultdict(int)
    
    for w in open('filename.dat').read().split():
        word_counts[w.lower()] += 1
    
    for w, c in word_counts.iteritems():
        print w, "occurs", word_counts[w], "times"
    
    0 讨论(0)
提交回复
热议问题