Getting feature names from within a FeatureUnion + Pipeline

前端 未结 2 850
抹茶落季
抹茶落季 2020-12-30 02:02

I am using a FeatureUnion to join features found from the title and description of events:

union = FeatureUnion(
    transformer_list=[
    # Pipeline for pu         


        
相关标签:
2条回答
  • 2020-12-30 02:20

    Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?

    You are going to have to implement this method within your custom transform if you want this to work.

    Here is a concrete example for you:

    from sklearn.datasets import load_boston
    from sklearn.pipeline import FeatureUnion, Pipeline
    from sklearn.base import TransformerMixin
    import pandas as pd
    
    dat = load_boston()
    X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
    y = dat['target']
    
    # define first custom transformer
    class first_transform(TransformerMixin):
        def transform(self, df):
            return df
    
        def get_feature_names(self):
            return df.columns.tolist()
    
    
    class second_transform(TransformerMixin):
        def transform(self, df):
            return df
    
        def get_feature_names(self):
            return df.columns.tolist()
    
    
    
    pipe = Pipeline([
           ('features', FeatureUnion([
                        ('custom_transform_first', first_transform()),
                        ('custom_transform_second', second_transform())
                    ])
            )])
    
    >>> pipe.named_steps['features']_.get_feature_names()
    ['custom_transform_first__CRIM',
     'custom_transform_first__ZN',
     'custom_transform_first__INDUS',
     'custom_transform_first__CHAS',
     'custom_transform_first__NOX',
     'custom_transform_first__RM',
     'custom_transform_first__AGE',
     'custom_transform_first__DIS',
     'custom_transform_first__RAD',
     'custom_transform_first__TAX',
     'custom_transform_first__PTRATIO',
     'custom_transform_first__B',
     'custom_transform_first__LSTAT',
     'custom_transform_second__CRIM',
     'custom_transform_second__ZN',
     'custom_transform_second__INDUS',
     'custom_transform_second__CHAS',
     'custom_transform_second__NOX',
     'custom_transform_second__RM',
     'custom_transform_second__AGE',
     'custom_transform_second__DIS',
     'custom_transform_second__RAD',
     'custom_transform_second__TAX',
     'custom_transform_second__PTRATIO',
     'custom_transform_second__B',
     'custom_transform_second__LSTAT']
    

    Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.

    However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:

    1. Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.

    2. Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.

    Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.

    One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.

    0 讨论(0)
  • 2020-12-30 02:37

    You can call your different Vectorizers as a nested feature by this (thanks edesz):

    pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']
    

    And then you got the TfidfVectorizer() instance to pass in another function:

    Show_most_informative_features(pipevect,
           pipeline.named_steps['classifier'], n=MostIF)
    # or direct   
    print(pipevect.get_feature_names())
    
    0 讨论(0)
提交回复
热议问题