Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

房东的猫 提交于 2019-12-01 12:38:17

You chose the right way to avoid data leakage as you say - nested CV.

The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.

Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV. Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.

For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.

The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".

Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).

The way to do that is to now fit your internal model over the entire data. meaning to perform:

clf_auc.fit(X, y)

This is the moment to understand what you've done here:

  1. You have a model you can use, which is fitted over all the data available.
  2. When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.

And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!