问题
I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn
.
My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline
in the setup below?
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', xgb.XGBClassifier())
])
param_dist = {'classification__n_estimators': stats.randint(50, 500),
'classification__learning_rate': stats.uniform(0.01, 0.3),
'classification__subsample': stats.uniform(0.3, 0.6),
'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
'classification__colsample_bytree': stats.uniform(0.5, 0.5),
'classification__min_child_weight': [1, 2, 3, 4],
'sampling__ratio': np.linspace(0.25, 0.5, 10)
}
random_search = RandomizedSearchCV(model,
param_dist,
cv=StratifiedKFold(n_splits=5),
n_iter=10,
scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
回答1:
Your understanding is right. When you feed the pipeline
as model
, the training data (k-1)
is applied using .fit()
and testing is done on the k
th fold. Then sampling would be done on the training data.
The documentation for imblearn.pipeline .fit()
says:
Fit the model
Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.
来源:https://stackoverflow.com/questions/56011706/does-oversampling-happen-before-or-after-cross-validation-using-imblearn-pipelin