Retrieving total number of words with 2 or more letters in a document using python

六月ゝ 毕业季﹏ 提交于 2020-01-03 06:47:17

问题


I have a small Python script that calculates the top 10 most frequent words, 10 most infrequent words and the total number of words in a .txt document. According to the assignment, a word is defined as 2 letters or more. I have the 10 most frequent and the 10 most infrequent words printing fine, however when I attempt to print the total number of words in the document it prints the total number of all the words, including the single letter words (such as "a"). How can I get the total number of words to calculate ONLY the words that have 2 letters or more?

Here is my script:

from string import *
from collections import defaultdict
from operator import itemgetter
import re

number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)

"""Define function to count the total number of words"""
def count_words(s):
    unique_words = split(s)
    return len(unique_words)

"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
    if len(word) >= 2:
        counter[word] += 1


"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    total_words = total_words + count_words(line)
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
            counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words

I am not an expert with Python, this is for a Python class I am currently taking. The neatness of my code and proper formatting count against me in this assignment, if possible can someone also tell me if the format of this code is considered "good practice"?


回答1:


The list comprehension method:

def countWords(s):
    words = s.split()
    return len([word for word in words if len(word)>=2])

The verbose method:

def countWords(s):
    words = s.split()
    count = 0
    for word in words:
        if len(word) >= 2:
            count += 1
    return count

As an aside, kudos on using defaultdict, but I would go with collections.Counter:

words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]

Hope this helps




回答2:


Count words simply uses split()

You should use the match_words regular expression here too

def count_words(s):
    unique_words = split(s)
    return len(filter(lambda x: words_only.match(x):, unique_words))

Your style looks great :)




回答3:


I'm sorry, but I seem to have gone a bit overboard with this solution. I mean I've really picked your code apart, and then put it back together the way I would do it:

from collections import defaultdict
from operator import itemgetter
from heapq import nlargest, nsmallest
from itertools import starmap
from textwrap import dedent
import re

class WordCounter(object):
    """
    Count the number of words consisting of two letters or more.
    """

    words_only = re.compile(r'[a-z]{2,}', re.IGNORECASE)

    def __init__(self, filename, number=10):
        self.counter = defaultdict(int)

        # Open text document and find all words
        with open(filename, 'r') as txt_file:
            for word in self.words_only.findall(txt_file.read()):
                self.counter[word.lower()] += 1

        # Get total count
        self.total_words = sum(self.counter.values())

        # Most Frequent Words
        self.top_words = nlargest(
            number, self.counter.items(), itemgetter(1))

        # Least Frequent Words
        self.least_words = nsmallest(
            number, self.counter.items(), itemgetter(1))

    def __str__(self):
        """
        Summary of least and most used words, and total word count.
        """
        template = dedent("""
            Most Frequent Words:
            {0}

            Least Frequent Words:
            {1}

            Total Number of Words: {2}
            """)

        line_template = "{0}: {1}".format
        top_words = "\n".join(starmap(line_template, self.top_words))
        least_words = "\n".join(starmap(line_template, self.least_words))

        return template.format(top_words, least_words, self.total_words)


print WordCounter("charactermask.txt")

Here's a summary of the changes I've made, and why

  • Don't do from x import *. Some modules are designed to let you do it safely, but in general it's a bad idea due to namespace pollution. Import just the things you need, or import the module with a shortened name: import string as st. This will result in less buggy code.

  • Make it a class. Although writing it as a script is fine for these sort of things, it's a good habit to always wrap your code in classes or functions to better organize your code, and for when you need them in another project. Then you can just do from wordcounter import WordCounter and you're good to go.

  • Docstrings moved inside the code block. This way they'll be used automatically if you type help(my_class_or_function) in the interactive interpreter.

  • Comments are usually prefixed with # instead of being throwaway strings. It's not a big no-no but a rather common convention.

  • Use the with statement when opening files. It's a good habit. You don't have to worry about remembering to close them.

  • .strip().split() is redundant. Use just .split().

  • Use re.findall. This avoids the problem of words like "top-notch", which won't be counted at all using your method. With findall we're counting "top" and "notch", as per the definition. Also, it's faster. But we have to change the regexp a bit.

  • The words dict is unused. Deleted.

  • Use sum to calculate total word count. This solves the problem in your and inspectorG4dgets code, where the words_only pattern really needs to be used two times for each word -- once for the total and once for the word count -- to get a consistent result.

  • Use heapq.nlargest and heapq.nsmallest. They're faster and more memory-efficient than a full sort when you only need the n smallest or largest results.

  • Make functions that return strings that you may or may not wish to print. Using print statements directly is less flexible, though very nice for debugging.

  • For new code, use format string method instead of the % operator. The former was made to improve upon and replace the latter.

  • Use multi-line strings instead of multiple consecutive prints. It's easier to see what will actually get written, and it's easier to maintain. The textwrap.dedent function helps if you want to indent the string to the same level as the surrounding code.

Also there's the question of which is more readable: starmap(line_template, self.top_words) or [line_template(*x) for x in self.top_words]. Most people always prefer list comprehensions, and I usually agree with them, but here I liked the brevity of the starmap method.

All that being said, I concur with user1552512, your style looks great! Nice, readable code, well commented, very PEP 8-compliant. You'll go far. :)




回答4:


Personally, I think your code looks fine. I don't know if its "standard" python style, but it is easy to read. I'm pretty new to Python as well but here is my answer.

I'm assuming that your count_words(s) function is what calculates the total number of words. The problem you are having is that by just calling split; you are just separating the words by a space.

You only need to count the 2+ character of words, so in that function write a loop that counts only the number of words with 2+ characters in the unique_words list.



来源:https://stackoverflow.com/questions/12482844/retrieving-total-number-of-words-with-2-or-more-letters-in-a-document-using-pyth

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!