How can I use a custom feature selection function in scikit-learn's `pipeline`

后端 未结 5 1912
闹比i
闹比i 2021-01-30 18:47

Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u

相关标签:
5条回答
  • 2021-01-30 19:19

    I didn't find the accepted answer very clear, so here is my solution for others. Basically, the idea is making a new class based on BaseEstimator and TransformerMixin

    The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.

    from sklearn.base import TransformerMixin, BaseEstimator
    
    class NonNAselector(BaseEstimator, TransformerMixin):
    
        """Extract columns with less than x percentage NA to impute further
        in the line
        Class to use in the pipline
        -----
        attributes 
        fit : identify columns - in the training set
        transform : only use those columns
        """
    
        def __init__(self, perc=0.1):
            self.perc = perc
            self.columns_with_less_than_x_na_id = None
    
        def fit(self, X, y=None):
            self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
            return self
    
        def transform(self, X, y=None, **kwargs):
            return X[self.columns_with_less_than_x_na_id]
    
        def get_params(self, deep=False):
            return {"perc": self.perc}
    
    0 讨论(0)
  • 2021-01-30 19:22

    I just want to post my solution for completeness, and maybe it is useful to one or the other:

    class ColumnExtractor(object):
    
        def transform(self, X):
            cols = X[:,2:4] # column 3 and 4 are "extracted"
            return cols
    
        def fit(self, X, y=None):
            return self
    

    Then, it can be used in the Pipeline like so:

    clf = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('reduce_dim', ColumnExtractor()),           
        ('classification', GaussianNB())   
        ])
    

    EDIT: General solution

    And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:

    import numpy as np
    
    class ColumnExtractor(object):
    
        def __init__(self, cols):
            self.cols = cols
    
        def transform(self, X):
            col_list = []
            for c in self.cols:
                col_list.append(X[:, c:c+1])
            return np.concatenate(col_list, axis=1)
    
        def fit(self, X, y=None):
            return self
    
        clf = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
        ('classification', GaussianNB())   
        ])
    
    0 讨论(0)
  • 2021-01-30 19:26

    You can use the following custom transformer to select the columns specified:

    #Custom Transformer that extracts columns passed as an argument to its constructor

    class FeatureSelector( BaseEstimator, TransformerMixin ):
    
        #Class Constructor 
        def __init__( self, feature_names ):
            self._feature_names = feature_names 
    
        #Return self nothing else to do here    
        def fit( self, X, y = None ):
            return self 
    
        #Method that describes what we need this transformer to do
        def transform( self, X, y = None ):
            return X[ self._feature_names ]`
    

    Here feature_names is the list of features which you want to select For more details, you can refer to this link [1]: https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

    0 讨论(0)
  • 2021-01-30 19:42

    If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is

    select_3_and_4.transform = select_3_and_4.__call__
    select_3_and_4.fit = lambda x: select_3_and_4
    

    and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.

    Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.

    Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

    0 讨论(0)
  • 2021-01-30 19:42

    Adding on Sebastian Raschka's and eickenberg's answers, the requirements a transformer object should hold are specified in scikit-learn's documentation.

    There are several more requirements than just having fit and transform, if you want the estimator to usable in parameter estimation, such as implementing set_params.

    0 讨论(0)
提交回复
热议问题