ColumnTransformer with TfidfVectorizer produces “empty vocabulary” error

后端未结

关注

 2  1662

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, [\"a\"] in this example:

from skle


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2020-12-06 08:26
              
            
            
                                                                       
That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:


  A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.


Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

Change that to:

clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                         remainder="passthrough")


For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

dataset['a']

vs 

dataset[['a']]


Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                          ("tfidf_2", tfidf_2, "b")
                         ],
                         remainder="passthrough")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-06 08:34
              
            
            
                                                                       
we can create a custom tfidf transformer, which can take a array of columns and then join them before applying .fit() or .transform().

Try this!

from sklearn.base import BaseEstimator,TransformerMixin

class custom_tfidf(BaseEstimator,TransformerMixin):
    def __init__(self,tfidf):
        self.tfidf = tfidf

    def fit(self, X, y=None):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)
        self.tfidf.fit(joined_X)        
        return self

    def transform(self, X):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)

        return self.tfidf.transform(joined_X)        

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],
                        "b":[" gone fhgf wild","gone with wind"],
                        "c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)

clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")
clmn.fit_transform(dataset)

#
array([[0.36439074, 0.51853403, 0.72878149, 0.        , 0.        ,
        0.25926702, 1.        ],
       [0.        , 0.438501  , 0.        , 0.61629785, 0.61629785,
        0.2192505 , 2.        ]])


P.S. : May be you might want to create a tfidf vectorizer for each column, then create a dictionary with key as column name and value as fitted vectorizer. This dictionary can be used during transform of corresponding columns
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复