Python program that finds most frequent word in a .txt file, Must print word and its count

前端未结

关注

 6  1355

As of right now, I have a function to replace the countChars function,

def countWords(lines):
  wordDict = {}
  for line in lines:
    wordList = lines.split


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-12-08 00:03
              
            
            
                                                                       
This program is actually a 4-liner, if you use the powerful tools at your disposal:

with open(yourfile) as f:
    text = f.read()

words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
counts = collections.Counter(words)


The regular expression will find all words, irregardless of the punctuation adjacent to them (but counting apostrophes as part of the word).

A counter acts almost just like a dictionary, but you can do things like counts.most_common(10), and add counts, etc. See help(Counter)

I would also suggest that you not make functions printBy..., since only functions without side-effects are easy to reuse.

def countsSortedAlphabetically(counter, **kw):
    return sorted(counter.items(), **kw)

#def countsSortedNumerically(counter, **kw):
#    return sorted(counter.items(), key=lambda x:x[1], **kw)
#### use counter.most_common(n) instead

# `from pprint import pprint as pp` is also useful
def printByLine(tuples):
    print( '\n'.join(' '.join(map(str,t)) for t in tuples) )


Demo:

>>> words = Counter(['test','is','a','test'])
>>> printByLine( countsSortedAlphabetically(words, reverse=True) )
test 2
is 1
a 1


edit to address Mateusz Konieczny's comment: replaced [a-zA-Z'] with [\w']... the character class \w, according to the python docs, "Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched." (... but apparently doesn't match an apostrophe...) However \w includes _ and 0-9, so if you don't want those and you aren't working with unicode, you can use [a-zA-Z']; if you are working with unicode you'd need to do a negative assertion or something to subtract [0-9_] from the \w character class
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-12-08 00:06
              
            
            
                                                                       
 words = ['red', 'green', 'black', 'pink', 'black', 'white', 'black', 
'eyes','white', 'black', 'orange', 'pink', 'pink', 'red', 'red', 
'white', 'orange', 'white', "black", 'pink', 'green', 'green', 'pink', 
'green', 'pink','white', 'orange', "orange", 'red']

 from collections import Counter
 counts = Counter(words)
 top_four = counts.most_common(4)
 print(top_four)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-08 00:13
              
            
            
                                                                       
You have a simple typo, words where you want word.


Edit: You appear to have edited the source. Please use copy and paste to get it right the first time.

Edit 2: Apparently you're not the only one prone to typos. The real problem is that you have lines where you want line. I apologize for accusing you of editing the source.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-12-08 00:16
              
            
            
                                                                       
Importing Collections and defining the function

from collections import Counter 
def most_count(n):
  split_it = data_set.split() 
  b=Counter(split_it)  
  return b.most_common(n) 


Calling the functions specifying the top 'n' words you want. In my case n=15

most_count(15)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2020-12-08 00:18
              
            
            
                                                                       
Here a possible solution, not as elegant as ninjagecko's but still:

from collections import defaultdict

dicto = defaultdict(int)

with open('yourfile.txt') as f:
    for line in f:
        s_line = line.rstrip().split(',') #assuming ',' is the delimiter
        for ele in s_line:
            dicto[ele] += 1

 #dicto contians words as keys, word counts as values

 for k,v in dicto.iteritems():
     print k,v

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-12-08 00:19
              
            
            
                                                                       
If you need to count a number of words in a passage, then it is better to use regex.

Let's start with a simple example:

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document


Result:

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']


Note that "Is" and "is" are two different words. My guess is that you want the to count them the same, so we can just capitalize all the words, and then count them.

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears


Result:

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})


Are you good up to here?

Now we need to do exactly the same thing we did above just this time we are reading a file.

import re
from collections import Counter

with open('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复