Custom sklearn pipeline transformer giving “pickle.PicklingError”

问题

I am trying to create a custom transformer for a Python sklearn pipeline based on guidance from this tutorial: http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/

Right now my custom class/transformer looks like this:

class SelectBestPercFeats(BaseEstimator, TransformerMixin):
    def __init__(self, model=RandomForestRegressor(), percent=0.8,
                 random_state=52):
        self.model = model
        self.percent = percent
        self.random_state = random_state


    def fit(self, X, y, **fit_params):
        """
        Find features with best predictive power for the model, and
        have cumulative importance value less than self.percent
        """
        # Check parameters
        if not isinstance(self.percent, float):
            print("SelectBestPercFeats.percent is not a float, it should be...")
        elif not isinstance(self.random_state, int):
            print("SelectBestPercFeats.random_state is not a int, it should be...")

        # If checks are good proceed with fitting...
        else:
            try:
                self.model.fit(X, y)
            except:
                print("Error fitting model inside SelectBestPercFeats object")
                return self

            # Get feature importance
            try:
                feat_imp = list(self.model.feature_importances_)
                feat_imp_cum = pd.Series(feat_imp, index=X.columns) \
                    .sort_values(ascending=False).cumsum()

                # Get features whose cumulative importance is <= `percent`
                n_feats = len(feat_imp_cum[feat_imp_cum <= self.percent].index) + 1
                self.bestcolumns_ = list(feat_imp_cum.index)[:n_feats]
            except:
                print ("ERROR: SelectBestPercFeats can only be used with models with"\
                       " .feature_importances_ parameter")
        return self


    def transform(self, X, y=None, **fit_params):
        """
        Filter out only the important features (based on percent threshold)
        for the model supplied.

        :param X: Dataframe with features to be down selected
        """
        if self.bestcolumns_ is None:
            print("Must call fit function on SelectBestPercFeats object before transforming")
        else:
            return X[self.bestcolumns_]

I am integrating this Class into an sklearn pipeline like this:

# Define feature selection and model pipeline components
rf_simp = RandomForestRegressor(criterion='mse', n_jobs=-1,
                                n_estimators=600)
bestfeat = SelectBestPercFeats(rf_simp, feat_perc)
rf = RandomForestRegressor(n_jobs=-1,
                           criterion='mse',
                           n_estimators=200,
                           max_features=0.4,
                           )

# Build Pipeline
master_model = Pipeline([('feat_sel', bestfeat), ('rf', rf)])

# define GridSearchCV parameter space to search, 
#   only listing one parameter to simplify troubleshooting
param_grid = {
    'feat_select__percent': [0.8],
}

# Fit pipeline model
grid = GridSearchCV(master_model, cv=3, n_jobs=-1,
                    param_grid=param_grid)

# Search grid using CV, and get the best estimator
grid.fit(X_train, y_train)

Whenever I run the last line of code (grid.fit(X_train, y_train)) I get the following "PicklingError". Can anyone see what is causing this problem in my code?

EDIT:

Or, is there something in my Python setup that's wrong... Might I be missing a package or something similar? I just checked that I can import pickle successfully

Traceback (most recent call last): File "", line 5, in File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 945, in fit return self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 564, in _fit for parameters in parameter_iterable File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 768, in call self.retrieve() File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 719, in retrieve raise exception File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 682, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 608, in get raise self._value File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 385, in _handle_tasks put(task) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\pool.py", line 371, in send CustomizablePickler(buffer, self._reducers).dump(obj) _pickle.PicklingError: Can't pickle : attribute lookup SelectBestPercFeats on builtins failed

回答1:

The pickle package needs the custom class(es) to be defined in another module and then imported. So, create another python package file (e.g. transformation.py) and then import it like this from transformation import SelectBestPercFeats. That will resolve the pickling error.

回答2:

I had the same problem, but in my case the issue was using function transformers where pickle sometimes has difficulties in serializing functions. The solution for me was to use dill instead, though it is a bit slower.

来源：https://stackoverflow.com/questions/45335524/custom-sklearn-pipeline-transformer-giving-pickle-picklingerror

标签

python

scikit-learn

customization

pickle

pipeline