Order between using validation, training and test sets

六眼飞鱼酱① 提交于 2019-11-27 09:38:26

What Wikipedia means is actually your first approach.

1 Split data into training set, validation set and test set

2 Use the training set to fit the model (find the best parameters: coefficients of the polynomial).

That just means that you use your training data to fit a model.

3 Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")

That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.

You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.

Finally, use the test set to score the model fitted with the training set.

Why you need the validation set is pretty well explained in this stackexchange question https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set


In the end you can use any of your three aproaches.

  1. approach:

    is the fastest because you only train one model for every hyperparameter. also you don't need as much data as for the other two.

  2. approach:

    is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.

    You also need a lot of data because you split your data three times and that first part again in k folds.

    But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.

  3. approach:

    is in its pros and cons in between of the other two. Here you also have less likely overfitting.

In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.

The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.

There are two separate ways of approaching the problem:

  • Either you use an explicit validation set to do hyperparameter search & tuning
  • Or you use cross-validation

So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).

After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).

So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.

Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!