问题
I'm trying to write a program that will read in a text file and output a list of most common words (30 as the code is written now) along with their counts. so something like:
word1 count1
word2 count2
word3 count3
... ...
... ...
wordn countn
in order of count1 > count2 > count3 >... >countn. This is what I have so far but I cannot get the sorted function to perform what I want. The error I get now is:
TypeError: list indices must be integers, not tuple
I'm new to python. Any help would be appreciated. Thank you.
def count_func(dictionary_list):
return dictionary_list[1]
def print_top(filename):
word_list = {}
with open(filename, 'r') as input_file:
count = 0
#best
for line in input_file:
for word in line.split():
word = word.lower()
if word not in word_list:
word_list[word] = 1
else:
word_list[word] += 1
#sorted_x = sorted(word_list.items(), key=operator.itemgetter(1))
# items = sorted(word_count.items(), key=get_count, reverse=True)
word_list = sorted(word_list.items(), key=lambda x: x[1])
for word in word_list:
if (count > 30):#19
break
print "%s: %s" % (word, word_list[word])
count += 1
# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.
def main():
if len(sys.argv) != 3:
print 'usage: ./wordcount.py {--count | --topcount} file'
sys.exit(1)
option = sys.argv[1]
filename = sys.argv[2]
if option == '--count':
print_words(filename)
elif option == '--topcount':
print_top(filename)
else:
print 'unknown option: ' + option
sys.exit(1)
if __name__ == '__main__':
main()
回答1:
Use the collections.Counter class.
from collections import Counter
for word, count in Counter(words).most_common(30):
print(word, count)
Some unsolicited advice: Don't make so many functions until everything is working as one big block of code. Refactor into functions after it works. You don't even need a main section for a script this small.
回答2:
Using itertools
' groupby
:
from itertools import groupby
words = sorted([w.lower() for w in open("/path/to/file").read().split()])
count = [[item[0], len(list(item[1]))] for item in groupby(words)]
count.sort(key=lambda x: x[1], reverse = True)
for item in count[:5]:
print(*item)
This will list the file's words, sort them and list unique words and their occurrence. Subsequently, the found list is sorted by occurrence by:
count.sort(key=lambda x: x[1], reverse = True)
The
reverse = True
is to list the most common words first.In the line:
for item in count[:5]:
[:5]
defines the number of most occurring words to show.
回答3:
First method as others have suggested i.e. by using most_common(...)
doesn't work according to your needs cause it returns the nth first most common words and not the words whose count is less than or equal to n
:
Here's using most_common(...)
: note it just print the first nth most common words:
>>> import re
... from collections import Counter
... def print_top(filename, max_count):
... words = re.findall(r'\w+', open(filename).read().lower())
... for word, count in Counter(words).most_common(max_count):
... print word, count
... print_top('n.sh', 1)
force 1
The correct way would be as follows, note it prints all the words whose count is less than equal to count:
>>> import re
... from collections import Counter
... def print_top(filename, max_count):
... words = re.findall(r'\w+', open(filename).read().lower())
... for word, count in filter(lambda x: x[1]<=max_count, sorted(Counter(words).items(), key=lambda x: x[1], reverse=True)):
... print word, count
... print_top('n.sh', 1)
force 1
in 1
done 1
mysql 1
yes 1
egrep 1
for 1
1 1
print 1
bin 1
do 1
awk 1
reinstall 1
bash 1
mythtv 1
selections 1
install 1
v 1
y 1
回答4:
Here is my python3 solution. I was asked this question in an interview and the interviewer was happy this solution, albeit in a less time-constrained situation the other solutions provided above seem a lot nicer to me.
dict_count = {}
lines = []
file = open("logdata.txt", "r")
for line in file:# open("logdata.txt", "r"):
lines.append(line.replace('\n', ''))
for line in lines:
if line not in dict_count:
dict_count[line] = 1
else:
num = dict_count[line]
dict_count[line] = (num + 1)
def greatest(words):
greatest = 0
string = ''
for key, val in words.items():
if val > greatest:
greatest = val
string = key
return [greatest, string]
most_common = []
def n_most_common_words(n, words):
while len(most_common) < n:
most_common.append(greatest(words))
del words[(greatest(words)[1])]
n_most_common_words(20, dict_count)
print(most_common)
来源:https://stackoverflow.com/questions/39299600/trying-to-output-the-x-most-common-words-in-a-text-file