问题
I am working in scikit and I am trying to tune my XGBoost. I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score
to the training data?
Question2.
Since I included the StandardScaler()
in the pipeline does it make sense to include the X_train
in the cross_val_score
or should I use a standardized form of the X_train
(i.e. std_X_train
)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
回答1:
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV. Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data. meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
- You have a model you can use, which is fitted over all the data available.
- When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.
来源:https://stackoverflow.com/questions/52138897/fitting-in-nested-cross-validation-with-cross-val-score-with-pipeline-and-gridse