How can I add items to collection.Counter? and then sort them into ASC?

梦想的初衷 提交于 2019-12-11 11:18:43

问题


At the moment I'm trying to process lingspam dataset by counting the occurance of words in 600 files (400 emails and 200 spam emails). I've already made each word universal with the Porter Stemmer Aglorithm, I would also like for my result to be standardized across each file for further processing. But I'm unsure on how I can accomplish this..

Resources thus far

  • 8.3. collections — Container datatypes
  • How to count co-ocurrences with collections.Counter() in python?
  • Bag of Words model

In order to get the output below I need to be able to add items that may not exist inside the file, in ascending order.

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0,  'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0,  'sales', 1)]

Which I then plan on converting into vectors using numpy.

[0,0,0]
[2,0,0]
[0,0,0]

instead of..

printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]

How can I standardize my results from the Counter module into Ascending Order (while also adding items to the Counter Result that may not exist from my search_list)? I've tried something already below that simply reads from each text file and creates a list based on the search_list.

import numpy as np, os
from collections import Counter

def parse_bag(directory, search_list):
    words = []
    for (dirpath, dirnames, filenames) in os.walk(directory):
        for f in filenames:
            path = directory + "/" + f
            count_words(path, search_list)
    return;

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    wordfreq = Counter(filteredwords).most_common(5)
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)

Thanks


回答1:


From your question, it sounds like your requirements are that you want the same words in a consistent ordering across all files, with counts. This should do it for you:

def count_words(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    counter = Counter(filteredwords)
    for w in search_list:
        counter[w] += 0        # ensure exists
    wordfreq = sorted(counter.items())
    print "printing from " + filename
    print wordfreq

search_list = ['sale', 'univers', 'money']

sample output:

printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]

I don't think you want to use most_common at all since you specifically don't want the contents of each file to affect the ordering or list length.




回答2:


The call Counter(filteredwords) as you use in your example can count all the words, just like you intend - what it does not do is to give you the most used ones - i.e., there is no "most_common" method - For that you have to reprocess all items in the counter, in order to have a sequence of tuples contaning the (frequency, word), and sort that:

def most_common(counter, n=5):
     freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
     return [item[1] for item in freq[:n]]



回答3:


Combination of both jsbueno and Mu Mind

def count_words_SO(filename, search_list):
    textwords = open(filename, 'r').read().split()
    filteredwords = [t for t in textwords if t in search_list]
    counter = Counter(filteredwords)
    for w in search_list:
        counter[w] += 0        # ensure exists
    wordfreq = number_parse(counter)
    print "printing from " + filename
    print wordfreq

def number_parse(counter, n=5):
     freq = sorted (((value ,item) for item, value in counter.viewitems() ),    reverse=True)
     return [item[0] for item in freq[:n]]

Comes out with, just a little more work and I'll have it ready for a Neurel Network thanks all :)

printing from ./../lingspam_results/spmsgb19.txt.out
[0, 0, 0]
printing from ./../lingspam_results/spmsgb2.txt.out
[4, 0, 0]
printing from ./../lingspam_results/spmsgb20.txt.out
[10, 0, 0]


来源:https://stackoverflow.com/questions/12739318/how-can-i-add-items-to-collection-counter-and-then-sort-them-into-asc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!