How can I use a custom feature selection function in scikit-learn's `pipeline`

后端 未结 5 1913
闹比i
闹比i 2021-01-30 18:47

Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u

5条回答
  •  北荒
    北荒 (楼主)
    2021-01-30 19:22

    I just want to post my solution for completeness, and maybe it is useful to one or the other:

    class ColumnExtractor(object):
    
        def transform(self, X):
            cols = X[:,2:4] # column 3 and 4 are "extracted"
            return cols
    
        def fit(self, X, y=None):
            return self
    

    Then, it can be used in the Pipeline like so:

    clf = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('reduce_dim', ColumnExtractor()),           
        ('classification', GaussianNB())   
        ])
    

    EDIT: General solution

    And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:

    import numpy as np
    
    class ColumnExtractor(object):
    
        def __init__(self, cols):
            self.cols = cols
    
        def transform(self, X):
            col_list = []
            for c in self.cols:
                col_list.append(X[:, c:c+1])
            return np.concatenate(col_list, axis=1)
    
        def fit(self, X, y=None):
            return self
    
        clf = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
        ('classification', GaussianNB())   
        ])
    

提交回复
热议问题