Efficiently count word frequencies in python

前端未结

关注

 8  1098

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-11-29 05:12
              
            
            
                                                                       
you can try with sklearn

from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

    data=['i am student','the student suffers a lot']
    transformed_data =vectorizer.fit_transform(data)
    vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
    print (vocab)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  伪装坚强ぢ        
                
              
                            
                2020-11-29 05:13
              
            
            
                                                                       
Here's some benchmark. It'll look strange but the crudest code wins.

[code]:

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start


[out]:

38.306814909
24.8241138458
12.1182529926


Data size (154MB) used in the benchmark above:

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141


Some things to note:


With the sklearn version, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counter object
Then native Counter update version, it seems like Counter.update() is an expensive operation

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复