Counting the number of unique words in a document with Python

后端未结

关注

 6  1672

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower()


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-02 00:36
              
            
            
                                                                       
You can calculate the number of items in a set, list or tuple all the same with len(my_set) or len(my_list).

Edit: Calculating the numbers of times a word is used, is something different.

Here the obvious approach:

count = {}
for w in open('filename.dat').read().split():
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print "%s was found %d times" % (word, times)


If you want to avoid the if-clause, you can look at collections.defaultdict.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-02 00:46
              
            
            
                                                                       
I believe that Counter is all that you need in this case:

from collections import Counter

print Counter(yourtext.split())

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-02 00:47
              
            
            
                                                                       
A set, by definition, contains unique elements (in your case, you can't find the same 'lower cased string' twice there). So, what you have to do is simply get the count of elements in the set = the length of the set = len(set(...))
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-02 00:47
              
            
            
                                                                       
Your question already contains the answer. If s is the set of unique words in the document, then len(s) gives the number of elements in the set, i.e. the number of unique words in the document.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-12-02 00:48
              
            
            
                                                                       
You can use Counter

from collections import Counter
c = Counter(['mama','papa','mama'])


The result of c will be

Counter({'mama': 2, 'papa': 1})

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-02 00:53
              
            
            
                                                                       
I would say that that code counts the number of distinct words, not the number of unique words, which is the number of words which occur only once.

This counts the number of times that each word occurs:

from collections import defaultdict

word_counts = defaultdict(int)

for w in open('filename.dat').read().split():
    word_counts[w.lower()] += 1

for w, c in word_counts.iteritems():
    print w, "occurs", word_counts[w], "times"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复