Put customized functions in Sklearn pipeline

后端 未结 2 1961
礼貌的吻别
礼貌的吻别 2020-12-30 14:52

In my classification scheme, there are several steps including:

  1. SMOTE (Synthetic Minority Over-sampling Technique)
  2. Fisher criteria for feature selecti
相关标签:
2条回答
  • 2020-12-30 15:14

    I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

    If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

    EDIT:

    I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

    >>> from sklearn.base import BaseEstimator, TransformerMixin
    >>> from sklearn.preprocessing import StandardScaler
    >>> from sklearn.svm import SVC
    >>> from sklearn.pipeline import Pipeline
    >>> from sklearn.grid_search import GridSearchCV
    >>> from sklearn.datasets import load_iris
    >>> 
    >>> class Fisher(BaseEstimator, TransformerMixin):
    ...     def __init__(self,percentile=0.95):
    ...             self.percentile = percentile
    ...     def fit(self, X, y):
    ...             from numpy import shape, argsort, ceil
    ...             X_pos, X_neg = X[y==1], X[y==0]
    ...             X_mean = X.mean(axis=0)
    ...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    ...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
    ...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    ...             F = num/deno
    ...             sort_F = argsort(F)[::-1]
    ...             n_feature = (float(self.percentile)/100)*shape(X)[1]
    ...             self.ind_feature = sort_F[:ceil(n_feature)]
    ...             return self
    ...     def transform(self, x):
    ...             return x[self.ind_feature,:]
    ... 
    >>> 
    >>> data = load_iris()
    >>> 
    >>> pipeline = Pipeline([
    ...     ('fisher', Fisher()),
    ...     ('normal',StandardScaler()),
    ...     ('svm',SVC(class_weight='auto'))
    ... ])
    >>> 
    >>> grid = {
    ...     'fisher__percentile':[0.75,0.50],
    ...     'svm__C':[1,2]
    ... }
    >>> 
    >>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
    >>> model.fit(data.data,data.target)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
        return self._fit(X, y, ParameterGrid(self.param_grid))
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
        for parameters in parameter_iterable
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
        self.dispatch(function, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
        self.results = func(*args, **kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
        self.steps[-1][-1].fit(Xt, y, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
        (X.shape[0], y.shape[0]))
    ValueError: X and y have incompatible shapes.
    X has 1 samples, but y has 75.
    
    0 讨论(0)
  • 2020-12-30 15:20

    scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.

    For example, if the input to a pipeline is a series, the transformer would be as follows:

    def trans_func(input_series):
    return output_series
    
    from sklearn.preprocessing import FunctionTransformer
    transformer = FunctionTransformer(trans_func)
    
    sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
    sk_pipe.fit(train.desc, train.tag)
    

    where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.

    0 讨论(0)
提交回复
热议问题