Python program that finds most frequent word in a .txt file, Must print word and its count

前端 未结 6 1355
伪装坚强ぢ
伪装坚强ぢ 2020-12-07 23:29

As of right now, I have a function to replace the countChars function,

def countWords(lines):
  wordDict = {}
  for line in lines:
    wordList = lines.split         


        
相关标签:
6条回答
  • 2020-12-08 00:03

    This program is actually a 4-liner, if you use the powerful tools at your disposal:

    with open(yourfile) as f:
        text = f.read()
    
    words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
    counts = collections.Counter(words)
    

    The regular expression will find all words, irregardless of the punctuation adjacent to them (but counting apostrophes as part of the word).

    A counter acts almost just like a dictionary, but you can do things like counts.most_common(10), and add counts, etc. See help(Counter)

    I would also suggest that you not make functions printBy..., since only functions without side-effects are easy to reuse.

    def countsSortedAlphabetically(counter, **kw):
        return sorted(counter.items(), **kw)
    
    #def countsSortedNumerically(counter, **kw):
    #    return sorted(counter.items(), key=lambda x:x[1], **kw)
    #### use counter.most_common(n) instead
    
    # `from pprint import pprint as pp` is also useful
    def printByLine(tuples):
        print( '\n'.join(' '.join(map(str,t)) for t in tuples) )
    

    Demo:

    >>> words = Counter(['test','is','a','test'])
    >>> printByLine( countsSortedAlphabetically(words, reverse=True) )
    test 2
    is 1
    a 1
    

    edit to address Mateusz Konieczny's comment: replaced [a-zA-Z'] with [\w']... the character class \w, according to the python docs, "Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched." (... but apparently doesn't match an apostrophe...) However \w includes _ and 0-9, so if you don't want those and you aren't working with unicode, you can use [a-zA-Z']; if you are working with unicode you'd need to do a negative assertion or something to subtract [0-9_] from the \w character class

    0 讨论(0)
  • 2020-12-08 00:06
     words = ['red', 'green', 'black', 'pink', 'black', 'white', 'black', 
    'eyes','white', 'black', 'orange', 'pink', 'pink', 'red', 'red', 
    'white', 'orange', 'white', "black", 'pink', 'green', 'green', 'pink', 
    'green', 'pink','white', 'orange', "orange", 'red']
    
     from collections import Counter
     counts = Counter(words)
     top_four = counts.most_common(4)
     print(top_four)
    
    0 讨论(0)
  • 2020-12-08 00:13

    You have a simple typo, words where you want word.

    Edit: You appear to have edited the source. Please use copy and paste to get it right the first time.

    Edit 2: Apparently you're not the only one prone to typos. The real problem is that you have lines where you want line. I apologize for accusing you of editing the source.

    0 讨论(0)
  • 2020-12-08 00:16

    Importing Collections and defining the function

    from collections import Counter 
    def most_count(n):
      split_it = data_set.split() 
      b=Counter(split_it)  
      return b.most_common(n) 
    

    Calling the functions specifying the top 'n' words you want. In my case n=15

    most_count(15)
    
    0 讨论(0)
  • 2020-12-08 00:18

    Here a possible solution, not as elegant as ninjagecko's but still:

    from collections import defaultdict
    
    dicto = defaultdict(int)
    
    with open('yourfile.txt') as f:
        for line in f:
            s_line = line.rstrip().split(',') #assuming ',' is the delimiter
            for ele in s_line:
                dicto[ele] += 1
    
     #dicto contians words as keys, word counts as values
    
     for k,v in dicto.iteritems():
         print k,v
    
    0 讨论(0)
  • 2020-12-08 00:19

    If you need to count a number of words in a passage, then it is better to use regex.

    Let's start with a simple example:

    import re
    
    my_string = "Wow! Is this true? Really!?!? This is crazy!"
    
    words = re.findall(r'\w+', my_string) #This finds words in the document
    

    Result:

    >>> words
    ['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
    

    Note that "Is" and "is" are two different words. My guess is that you want the to count them the same, so we can just capitalize all the words, and then count them.

    from collections import Counter
    
    cap_words = [word.upper() for word in words] #capitalizes all the words
    
    word_counts = Counter(cap_words) #counts the number each time a word appears
    

    Result:

    >>> word_counts
    Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
    

    Are you good up to here?

    Now we need to do exactly the same thing we did above just this time we are reading a file.

    import re
    from collections import Counter
    
    with open('your_file.txt') as f:
        passage = f.read()
    
    words = re.findall(r'\w+', passage)
    
    cap_words = [word.upper() for word in words]
    
    word_counts = Counter(cap_words)
    
    0 讨论(0)
提交回复
热议问题