How to extract feature importances from an Sklearn pipeline

前端 未结 2 725
情书的邮戳
情书的邮戳 2020-12-31 10:32

I\'ve built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier.

While I can save that pipeline, look at

相关标签:
2条回答
  • 2020-12-31 11:09

    Ah, yes it is.

    You list identify the step where you want to check the estimator:

    For instance:

    pipeline.steps[1]
    

    Which returns:

    ('predictor',
     RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                 max_depth=None, max_features='auto', max_leaf_nodes=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2,
                 oob_score=False, random_state=None, verbose=0,
                 warm_start=False))
    

    You can then access the model step directly:

    pipeline.steps[1][1].feature_importances_

    0 讨论(0)
  • 2020-12-31 11:21

    I wrote an article on doing this in general you can find here.

    In general for a pipeline you can access the named_steps parameter. This will give you each transformer in a pipeline. So for example for this pipeline:

    model = Pipeline(
    [
        ("vectorizer", CountVectorizer()),
        ("transformer", TfidfTransformer()),
        ("classifier", classifier),
    ])
    

    we could access the individual feature steps by doing model.named_steps["transformer"].get_feature_names() This will return the list of feature names from the TfidfTransformer. This is all fine and good but doesn't really cover many use cases since we normally want to combine a few features. Take this model for example:

    model = Pipeline([
    ("union", FeatureUnion(transformer_list=[
        ("h1", TfidfVectorizer(vocabulary={"worst": 0})),
        ("h2", TfidfVectorizer(vocabulary={"best": 0})),
        ("h3", TfidfVectorizer(vocabulary={"awful": 0})),
        ("tfidf_cls", Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TfidfTransformer())
        ]
        ))
    ])
     ),
    ("classifier", classifier)])
    

    Here we combine a few features using a feature union and a subpipeline. To access these features we'd need to explicitly call each named step in order. For example getting the TF-IDF features from the internal pipeline we'd have to do:

    model.named_steps["union"].tranformer_list[3][1].named_steps["transformer"].get_feature_names()
    

    That's kind of a headache but it is doable. Usually what I do is use a variation of the following snippet to get it. The below code just treats sets of pipelines/feature unions as a tree and performs DFS combining the feature_names as it goes.

    from sklearn.pipeline import FeatureUnion, Pipeline
    
    def get_feature_names(model, names: List[str], name: str) -> List[str]:
        """Thie method extracts the feature names in order from a Sklearn Pipeline
        
        This method only works with composed Pipelines and FeatureUnions.  It will
        pull out all names using DFS from a model.
    
        Args:
            model: The model we are interested in
            names: The list of names of final featurizaiton steps
            name: The current name of the step we want to evaluate.
    
        Returns:
            feature_names: The list of feature names extracted from the pipeline.
        """
        
        # Check if the name is one of our feature steps.  This is the base case.
        if name in names:
            # If it has the named_steps atribute it's a pipeline and we need to access the features
            if hasattr(model, "named_steps"):
                return extract_feature_names(model.named_steps[name], name)
            # Otherwise get the feature directly
            else:
                return extract_feature_names(model, name)
        elif type(model) is Pipeline:
            feature_names = []
            for name in model.named_steps.keys():
                feature_names += get_feature_names(model.named_steps[name], names, name)
            return feature_names
        elif type(model) is FeatureUnion:
            feature_names= []
            for name, new_model in model.transformer_list:
                feature_names += get_feature_names(new_model, names, name)
            return feature_names
        # If it is none of the above do not add it.
        else:
            return []
    

    You'll also need this method. Which operates on individual transformations, things like the TfidfVectorizer, to get the names. In SciKit-Learn there isn't a universal get_feature_names so you have to kind of fudge it for each different case. This is my attempt at doing something reasonable for most use cases.

    def extract_feature_names(model, name) -> List[str]:
      """Extracts the feature names from arbitrary sklearn models
      
      Args:
        model: The Sklearn model, transformer, clustering algorithm, etc. which we want to get named features for.
        name: The name of the current step in the pipeline we are at.
    
      Returns:
        The list of feature names.  If the model does not have named features it constructs feature names
    by appending an index to the provided name.
      """
        if hasattr(model, "get_feature_names"):
            return model.get_feature_names()
        elif hasattr(model, "n_clusters"):
            return [f"{name}_{x}" for x in range(model.n_clusters)]
        elif hasattr(model, "n_components"):
            return [f"{name}_{x}" for x in range(model.n_components)]
        elif hasattr(model, "components_"):
            n_components = model.components_.shape[0]
            return [f"{name}_{x}" for x in range(n_components)]
        elif hasattr(model, "classes_"):
            return classes_
        else:
            return [name]
    
    0 讨论(0)
提交回复
热议问题