Use sklearn TfidfVectorizer with already tokenized inputs?

后端未结

关注

 3  552

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2021-02-05 14:46
              
            
            
                                                                       
Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


or from nltk:

>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))


or combine both sets:

stop_words = stop_words.union(ENGLISH_STOP_WORDS)


But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS set). 

Noetheless @Jarad's examples work:

>>> tokenized_list_of_sentences =  [
...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
...     ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)


I like pd.DataFrames for browsing TF-IDF vectors:

>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
        cat       dog 
0  0.814802  0.579739
1  0.000000  1.000000

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-02-05 15:03
              
            
            
                                                                       
Try preprocessor instead of tokenizer.

    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'


If x in the above error message is a list, then doing x.lower() to a list will throw the error.

Your two examples are all stopwords so to make this example return something, throw in a few random words. Here's an example:

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)


Returns:

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>


Features:

>>> tfidf.get_feature_names()
['cat', 'dog']




UPDATE: maybe use lambdas on tokenizer and preprocessor?

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2021-02-05 15:08
              
            
            
                                                                       
Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)


Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复