How do I SelectKBest using mutual information from a mixture of discrete and continuous features?

后端 未结 2 1258
青春惊慌失措
青春惊慌失措 2021-01-20 01:11

I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutua

相关标签:
2条回答
  • 2021-01-20 01:45

    Unfortunately I could not find this functionality for the SelectKBest. But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.

    This is the current fit() method of SelectKBest (taken from source at github)

    # No provision for extra parameters here
    def fit(self, X, y):
        X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
    
        ....
        ....
    
        # Here only the X, y are passed to scoring function
        score_func_ret = self.score_func(X, y)
    
        ....        
        ....
    
        self.scores_ = np.asarray(self.scores_)
    
        return self
    

    Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):

    from sklearn.utils import check_X_y
    
    class SelectKBestCustom(SelectKBest):
    
        # Changed here
        def fit(self, X, y, discrete_features='auto'):
            X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
    
            if not callable(self.score_func):
                raise TypeError("The score function should be a callable, %s (%s) "
                            "was passed."
                            % (self.score_func, type(self.score_func)))
    
            self._check_params(X, y)
    
            # Changed here also
            score_func_ret = self.score_func(X, y, discrete_features)
            if isinstance(score_func_ret, (list, tuple)):
                self.scores_, self.pvalues_ = score_func_ret
                self.pvalues_ = np.asarray(self.pvalues_)
            else:
                self.scores_ = score_func_ret
                self.pvalues_ = None
    
            self.scores_ = np.asarray(self.scores_)
            return self
    

    This can be called simply like:

    clf = SelectKBestCustom(mutual_info_classif,k=2)
    clf.fit(X, y, discrete_features=[0, 1, 2])
    

    Edit: The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().

    Another Solution (less preferable): Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:

    def mutual_info_classif_custom(X, y):
        # To change discrete_features, 
        # you need to redefine the function each time
        # Because once the func def is supplied to selectKBest, it cant be changed
        discrete_features = [0, 1, 2]
    
        return mutual_info_classif(X, y, discrete_features)
    

    Usage of the above function:

    selector = SelectKBest(mutual_info_classif_custom).fit(X, y)
    
    0 讨论(0)
  • You could also use partials as follows:

    from functools import partial
    
    discrete_mutual_info_classif = partial(mutual_info_classif, iscrete_features=[0, 1, 2])
    SelectKBest(score_func=discrete_mutual_info_classif).fit(x, y)
    
    0 讨论(0)
提交回复
热议问题