Efficiently count word frequencies in python

前端 未结 8 1097
走了就别回头了
走了就别回头了 2020-11-29 04:33

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':

相关标签:
8条回答
  • 2020-11-29 04:54

    The most succinct approach is to use the tools Python gives you.

    from future_builtins import map  # Only on Python 2
    
    from collections import Counter
    from itertools import chain
    
    def countInFile(filename):
        with open(filename) as f:
            return Counter(chain.from_iterable(map(str.split, f)))
    

    That's it. map(str.split, f) is making a generator that returns lists of words from each line. Wrapping in chain.from_iterable converts that to a single generator that produces a word at a time. Counter takes an input iterable and counts all unique values in it. At the end, you return a dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

    In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict or collections.defaultdict(int) to count (because Counter is implemented in Python, which can make it slower in some cases), but letting Counter do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

    Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

    from string import punctuation
    
    def countInFile(filename):
        with open(filename) as f:
            linewords = (line.translate(None, punctuation).lower().split() for line in f)
            return Counter(chain.from_iterable(linewords))
    

    Your code runs much more slowly because it's creating and destroying many small Counter and set objects, rather than .update-ing a single Counter once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).

    0 讨论(0)
  • 2020-11-29 04:57

    Combining every ones else's views and some of my own :) Here is what I have for you

    from collections import Counter
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    text='''Note that if you use RegexpTokenizer option, you lose 
    natural language features special to word_tokenize 
    like splitting apart contractions. You can naively 
    split on the regex \w+ without any need for the NLTK.
    '''
    
    # tokenize
    raw = ' '.join(word_tokenize(text.lower()))
    
    tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
    words = tokenizer.tokenize(raw)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # count word frequency, sort and return just 20
    counter = Counter()
    counter.update(words)
    most_common = counter.most_common(20)
    most_common
    

    Output

    (All ones)

    [('note', 1),
     ('use', 1),
     ('regexptokenizer', 1),
     ('option', 1),
     ('lose', 1),
     ('natural', 1),
     ('language', 1),
     ('features', 1),
     ('special', 1),
     ('word', 1),
     ('tokenize', 1),
     ('like', 1),
     ('splitting', 1),
     ('apart', 1),
     ('contractions', 1),
     ('naively', 1),
     ('split', 1),
     ('regex', 1),
     ('without', 1),
     ('need', 1)]
    

    One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

    0 讨论(0)
  • 2020-11-29 04:59

    Skip CountVectorizer and scikit-learn.

    The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into 10-20 smaller files and extend your code to loop over the smaller files.

    0 讨论(0)
  • 2020-11-29 05:04

    Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translate expects its second argument to be a byte string, I utf-8 encode punctuation. After removing punctuations, I utf-8 decode the byte string.

    The function freq_dist expects an iterable. That's why I've passed data.splitlines().

    from urllib2 import urlopen
    from collections import Counter
    from string import punctuation
    from time import time
    import sys
    from pprint import pprint
    
    url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
    
    data = urlopen(url).read()
    
    def freq_dist(data):
        """
        :param data: file-like object opened in binary mode or
                     sequence of byte strings separated by '\n'
        :type data: an iterable sequence
        """
        #For readability   
        #return Counter(word for line in data
        #    for word in line.translate(
        #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())
    
        punc = punctuation.encode('utf-8')
        words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
        return Counter(words)
    
    
    start = time()
    word_dist = freq_dist(data.splitlines())
    print('elapsed: {}'.format(time() - start))
    pprint(word_dist.most_common(10))
    

    Output;

    elapsed: 0.806480884552
    
    [(u'de', 11106),
     (u'a', 6742),
     (u'que', 5701),
     (u'la', 4319),
     (u'je', 4260),
     (u'se', 3938),
     (u'\u043d\u0430', 3929),
     (u'na', 3623),
     (u'da', 3534),
     (u'i', 3487)]
    

    It seems dict is more efficient than Counter object.

    def freq_dist(data):
        """
        :param data: A string with sentences separated by '\n'
        :type data: str
        """
        d = {}
        punc = punctuation.encode('utf-8')
        words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
        for word in words:
            d[word] = d.get(word, 0) + 1
        return d
    
    start = time()
    word_dist = freq_dist(data.splitlines())
    print('elapsed: {}'.format(time() - start))
    pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])
    

    Output;

    elapsed: 0.642680168152
    
    [(u'de', 11106),
     (u'a', 6742),
     (u'que', 5701),
     (u'la', 4319),
     (u'je', 4260),
     (u'se', 3938),
     (u'\u043d\u0430', 3929),
     (u'na', 3623),
     (u'da', 3534),
     (u'i', 3487)]
    

    To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.

    data = urlopen(url)
    word_dist = freq_dist(data)
    
    0 讨论(0)
  • 2020-11-29 05:06

    A memory efficient and accurate way is to make use of

    • CountVectorizer in scikit (for ngram extraction)
    • NLTK for word_tokenize
    • numpy matrix sum to collect the counts
    • collections.Counter for collecting the counts and vocabulary

    An example:

    import urllib.request
    from collections import Counter
    
    import numpy as np 
    
    from nltk import word_tokenize
    from sklearn.feature_extraction.text import CountVectorizer
    
    # Our sample textfile.
    url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
    response = urllib.request.urlopen(url)
    data = response.read().decode('utf8')
    
    
    # Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    # X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    
    # Vocabulary
    vocab = list(ngram_vectorizer.get_feature_names())
    
    # Column-wise sum of the X matrix.
    # It's some crazy numpy syntax that looks horribly unpythonic
    # For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
    # and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
    counts = X.sum(axis=0).A1
    
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))
    

    [out]:

    [(',', 32000),
     ('.', 17783),
     ('de', 11225),
     ('a', 7197),
     ('que', 5710),
     ('la', 4732),
     ('je', 4304),
     ('se', 4013),
     ('на', 3978),
     ('na', 3834)]
    

    Essentially, you can also do this:

    from collections import Counter
    import numpy as np 
    from nltk import word_tokenize
    from sklearn.feature_extraction.text import CountVectorizer
    
    def freq_dist(data):
        """
        :param data: A string with sentences separated by '\n'
        :type data: str
        """
        ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
        X = ngram_vectorizer.fit_transform(data.split('\n'))
        vocab = list(ngram_vectorizer.get_feature_names())
        counts = X.sum(axis=0).A1
        return Counter(dict(zip(vocab, counts)))
    

    Let's timeit:

    import time
    
    start = time.time()
    word_distribution = freq_dist(data)
    print (time.time() - start)
    

    [out]:

    5.257147789001465
    

    Note that CountVectorizer can also take a file instead of a string and there's no need to read the whole file into memory. In code:

    import io
    from collections import Counter
    
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    
    infile = '/path/to/input.txt'
    
    ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
    
    with io.open(infile, 'r', encoding='utf8') as fin:
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
        freq_distribution = Counter(dict(zip(vocab, counts)))
        print (freq_distribution.most_common(10))
    
    0 讨论(0)
  • 2020-11-29 05:06

    This should suffice.

    def countinfile(filename):
        d = {}
        with open(filename, "r") as fin:
            for line in fin:
                words = line.strip().split()
                for word in words:
                    try:
                        d[word] += 1
                    except KeyError:
                        d[word] = 1
        return d
    
    0 讨论(0)
提交回复
热议问题