Efficiently count word frequencies in python

前端 未结 8 1098
走了就别回头了
走了就别回头了 2020-11-29 04:33

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':

相关标签:
8条回答
  • 2020-11-29 05:12

    you can try with sklearn

    from sklearn.feature_extraction.text import CountVectorizer
        vectorizer = CountVectorizer()
    
        data=['i am student','the student suffers a lot']
        transformed_data =vectorizer.fit_transform(data)
        vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
        print (vocab)
    
    0 讨论(0)
  • 2020-11-29 05:13

    Here's some benchmark. It'll look strange but the crudest code wins.

    [code]:

    from collections import Counter, defaultdict
    import io, time
    
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    
    infile = '/path/to/file'
    
    def extract_dictionary_sklearn(file_path):
        with io.open(file_path, 'r', encoding='utf8') as fin:
            ngram_vectorizer = CountVectorizer(analyzer='word')
            X = ngram_vectorizer.fit_transform(fin)
            vocab = ngram_vectorizer.get_feature_names()
            counts = X.sum(axis=0).A1
        return Counter(dict(zip(vocab, counts)))
    
    def extract_dictionary_native(file_path):
        dictionary = Counter()
        with io.open(file_path, 'r', encoding='utf8') as fin:
            for line in fin:
                dictionary.update(line.split())
        return dictionary
    
    def extract_dictionary_paddle(file_path):
        dictionary = defaultdict(int)
        with io.open(file_path, 'r', encoding='utf8') as fin:
            for line in fin:
                for words in line.split():
                    dictionary[word] +=1
        return dictionary
    
    start = time.time()
    extract_dictionary_sklearn(infile)
    print time.time() - start
    
    start = time.time()
    extract_dictionary_native(infile)
    print time.time() - start
    
    start = time.time()
    extract_dictionary_paddle(infile)
    print time.time() - start
    

    [out]:

    38.306814909
    24.8241138458
    12.1182529926
    

    Data size (154MB) used in the benchmark above:

    $ wc -c /path/to/file
    161680851
    
    $ wc -l /path/to/file
    2176141
    

    Some things to note:

    • With the sklearn version, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counter object
    • Then native Counter update version, it seems like Counter.update() is an expensive operation
    0 讨论(0)
提交回复
热议问题