问题
GridSearchCV
and RandomizedSearchCV
has best_estimator_
that :
- Returns only the best estimator/model
- Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
- Evaluate based on training sets only
I would like to enrich those limitations with
- My own definition of scoring methods
- Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
- Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
- With every estimator/model, predict on test set and evaluate with my customized score
My question is:
- Is there a way to get all individual models from
GridSearchCV
? - If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing
GridSearchCV
because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.
回答1:
You can use custom scoring methods already in the XYZSearchCV
s: see the scoring
parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv
parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV
. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV
are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV
part of the searches means the models are being scored on unseen data. If you get very high best_score_
but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.
来源:https://stackoverflow.com/questions/62864193/get-individual-models-and-customized-score-in-gridsearchcv-and-randomizedsearchc