Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

后端 未结 2 797
梦毁少年i
梦毁少年i 2021-01-30 07:26

I want to get feature names after I fit the pipeline.

categorical_features = [\'brand\', \'category_name\', \'sub_category\']
categorical_transformer = Pipeline(s         


        
相关标签:
2条回答
  • 2021-01-30 07:34

    EDIT: actually Peter's comment answer is in the ColumnTransformer doc:

    The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.


    To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer .get_feature_names() method depends on the order of declaration of the steps variable at the ColumnTransformer instanciation.

    I could not find any doc so I just played with the toy example below and that let me understand the logic.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.preprocessing import RobustScaler
    
    class testEstimator(BaseEstimator,TransformerMixin):
        def __init__(self,string):
            self.string = string
    
        def fit(self,X):
            return self
    
        def transform(self,X):
            return np.full(X.shape, self.string).reshape(-1,1)
    
        def get_feature_names(self):
            return self.string
    
    transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
    column_transformer = ColumnTransformer(transformers)
    steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
    pipeline = Pipeline(steps)
    
    dt_test = np.zeros((1000,2))
    pipeline.fit_transform(dt_test)
    
    for name,step in pipeline.named_steps.items():
        if hasattr(step, 'get_feature_names'):
            print(step.get_feature_names())
    

    For the sake of having a more representative example I added a RobustScaler and nested the ColumnTransformer on a Pipeline. By the way, you will find my version of Venkatachalam's way to get the feature name looping of the steps. You can turn it into a slightly more usable variable by unpacking the names with a list comprehension:

    [i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
    

    So play around with the dt_test and the estimators to soo how the feature name is built, and how it is concatenated in the get_feature_names(). Here is another example with a transformer which output 2 columns, using the input column:

    class testEstimator3(BaseEstimator,TransformerMixin):
        def __init__(self,string):
            self.string = string
    
        def fit(self,X):
            self.unique = np.unique(X)[0]
            return self
    
        def transform(self,X):
            return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)
    
        def get_feature_names(self):
            return list((self.unique,self.string))
    
    dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)
    
    transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
    column_transformer = ColumnTransformer(transformers)
    steps = [('transformer', column_transformer)]
    pipeline = Pipeline(steps)
    
    pipeline.fit_transform(dt_test2)
    for step in pipeline.steps:
        if hasattr(step[1], 'get_feature_names'):
            print(step[1].get_feature_names())
    
    0 讨论(0)
  • 2021-01-30 07:54

    You can access the feature_names using the following snippet!

    clf.named_steps['preprocessor'].transformers_[1][1]\
       .named_steps['onehot'].get_feature_names(categorical_features)
    

    Using sklearn >= 0.21 version, we can make it more simpler:

    clf['preprocessor'].transformers_[1][1]['onehot']\
                       .get_feature_names(categorical_features)
    

    Reproducible example:

    import numpy as np
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import LinearRegression
    
    df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
                       'category': ['asdf', 'asfa', 'asdfas', 'as'],
                       'num1': [1, 1, 0, 0],
                       'target': [0.2, 0.11, 1.34, 1.123]})
    
    numeric_features = ['num1']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])
    
    categorical_features = ['brand', 'category']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('regressor',  LinearRegression())])
    clf.fit(df.drop('target', 1), df['target'])
    
    clf.named_steps['preprocessor'].transformers_[1][1]\
       .named_steps['onehot'].get_feature_names(categorical_features)
    
    # ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
    #  'category_asdf' 'category_asdfas' 'category_asfa']
    
    0 讨论(0)
提交回复
热议问题