Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

前端 未结 1 561
忘了有多久
忘了有多久 2020-11-30 11:21

This seems like a very important issue for this library, and so far I don\'t see a decisive answer, although it seems like for the most part, the answer is \'No.\'

R

相关标签:
1条回答
  • 2020-11-30 11:51

    yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.

    Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.

    From Documentation of ColumnTransformer:

    Notes

    The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

    Try this!

    import pandas as pd
    import numpy as np
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import make_pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
    from sklearn.feature_extraction.text import _VectorizerMixin
    from sklearn.feature_selection._base import SelectorMixin
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_extraction.text import CountVectorizer
    
    train = pd.DataFrame({'age': [23,12, 12, np.nan],
                          'Gender': ['M','F', np.nan, 'F'],
                          'income': ['high','low','low','medium'],
                          'sales': [10000, 100020, 110000, 100],
                          'foo' : [1,0,0,1],
                          'text': ['I will test this',
                                   'need to write more sentence',
                                   'want to keep it simple',
                                   'hope you got that these sentences are junk'],
                          'y': [0,1,1,1]})
    numeric_columns = ['age']
    cat_columns     = ['Gender','income']
    
    numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
    cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
    text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
    
    transformers = [
    ('num', numeric_pipeline, numeric_columns),
    ('cat', cat_pipeline, cat_columns),
    ('text', text_pipeline, 'text'),
    ('simple_transformer', MinMaxScaler(), ['sales']),
    ]
    
    combined_pipe = ColumnTransformer(transformers, remainder='passthrough')
    
    transformed_data = combined_pipe.fit_transform(train.drop('y',1), train['y'])
    
    
    def get_feature_out(estimator, feature_in):
        if hasattr(estimator,'get_feature_names'):
            if isinstance(estimator, _VectorizerMixin):
                # handling all vectorizers
                return [f'vec_{f}' \
                    for f in estimator.get_feature_names()]
            else:
                return estimator.get_feature_names(feature_in)
        elif isinstance(estimator, SelectorMixin):
            return np.array(feature_in)[estimator.get_support()]
        else:
            return feature_in
    
    
    def get_ct_feature_names(ct):
        # handles all estimators, pipelines inside ColumnTransfomer
        # doesn't work when remainder =='passthrough'
        # which requires the input column names.
        output_features = []
    
        for name, estimator, features in ct.transformers_:
            if name!='remainder':
                if isinstance(estimator, Pipeline):
                    current_features = features
                    for step in estimator:
                        current_features = get_feature_out(step, current_features)
                    features_out = current_features
                else:
                    features_out = get_feature_out(estimator, features)
                output_features.extend(features_out)
            elif estimator=='passthrough':
                output_features.extend(ct._feature_names_in[features])
                    
        return output_features
    
    
    
    
    pd.DataFrame(transformed_data, 
                 columns=get_ct_feature_names(combined_pipe))
    

    0 讨论(0)
提交回复
热议问题