问题
At the moment I'm trying to process lingspam dataset by counting the occurance of words in 600 files (400 emails and 200 spam emails). I've already made each word universal with the Porter Stemmer Aglorithm, I would also like for my result to be standardized across each file for further processing. But I'm unsure on how I can accomplish this..
Resources thus far
- 8.3. collections — Container datatypes
- How to count co-ocurrences with collections.Counter() in python?
- Bag of Words model
In order to get the output below I need to be able to add items that may not exist inside the file, in ascending order.
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0, 'sales', 1)]
Which I then plan on converting into vectors
using numpy
.
[0,0,0]
[2,0,0]
[0,0,0]
instead of..
printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]
How can I standardize my results from the Counter
module into Ascending Order
(while also adding items to the Counter Result that may not exist from my search_list
)? I've tried something already below that simply reads from each text file and creates a list based on the search_list
.
import numpy as np, os
from collections import Counter
def parse_bag(directory, search_list):
words = []
for (dirpath, dirnames, filenames) in os.walk(directory):
for f in filenames:
path = directory + "/" + f
count_words(path, search_list)
return;
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)
Thanks
回答1:
From your question, it sounds like your requirements are that you want the same words in a consistent ordering across all files, with counts. This should do it for you:
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = sorted(counter.items())
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
sample output:
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]
I don't think you want to use most_common
at all since you specifically don't want the contents of each file to affect the ordering or list length.
回答2:
The call Counter(filteredwords)
as you use in your example can count all the words, just like you intend - what it does not do is to give you the most used ones - i.e., there is no "most_common" method -
For that you have to reprocess all items in the counter, in order to have a sequence of tuples contaning the (frequency, word), and sort that:
def most_common(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[1] for item in freq[:n]]
回答3:
Combination of both jsbueno and Mu Mind
def count_words_SO(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = number_parse(counter)
print "printing from " + filename
print wordfreq
def number_parse(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[0] for item in freq[:n]]
Comes out with, just a little more work and I'll have it ready for a Neurel Network
thanks all :)
printing from ./../lingspam_results/spmsgb19.txt.out
[0, 0, 0]
printing from ./../lingspam_results/spmsgb2.txt.out
[4, 0, 0]
printing from ./../lingspam_results/spmsgb20.txt.out
[10, 0, 0]
来源:https://stackoverflow.com/questions/12739318/how-can-i-add-items-to-collection-counter-and-then-sort-them-into-asc