问题
I am running a Python 3 classification script on a server using the following code:
# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()
# define KNN parameters
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_jobs': [-1],
'weights': ['uniform', 'distance']}]
# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')
# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)
I then save the GridSearchCV
object using pickle
:
# save model
with open('knn_models.pickle', 'wb') as f:
pickle.dump(knn_models, f)
So I can test the classifiers on smaller datasets on my local machine by running:
knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_
Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:
- pull the original data out of the
GridSearchCV
object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required) - try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e.
k = 3, 5, 7
- retrieve
y_pred
i.e. the predictions for each validation set for all of the new classifiers that I tested above
回答1:
As discussed in the comments, GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_
returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_
attribute.
Adapting the example from the documentation to the knn classifier with your own knn_parameters
grid (but removing n_jobs
, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3
for simplicity, we have:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd
iris = load_iris()
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'weights': ['uniform', 'distance']}]
knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)
clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs
entry from the knn_parameters
grid and asking instead for n_jobs=-1
in the GridSearchCV
object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1
to your final model, you can easily manipulate the best_estimator_
to do so:
clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
weights='uniform')
This actually answers your second question, since you can similarly manipulate the best_estimator_
to change other hyperparameters, too.
So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_
attribute, which you can even import to a pandas dataframe for easier inspection:
cv_results = pd.DataFrame.from_dict(clf.cv_results_)
For example, the cv_results
dataframe includes a column rank_test_score
which, as its name clearly implies, contains the rank of each parameter combination:
cv_results['rank_test_score']
# result:
0 481
1 481
2 145
3 145
4 1
...
571 1
572 145
573 145
574 433
575 1
Name: rank_test_score, Length: 576, dtype: int32
Here 1
means best, and you can readily see that there are more than one combinations ranked as 1
- so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_
is just the first of these occurrences - here the combination number 4:
cv_results.iloc[4]
# result:
mean_fit_time 0.000669559
std_fit_time 1.55811e-05
mean_score_time 0.00474652
std_score_time 0.000488042
param_algorithm auto
param_leaf_size 5
param_n_neighbors 5
param_weights uniform
params {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score 0.98
split1_test_score 0.98
split2_test_score 0.98
mean_test_score 0.98
std_test_score 0
rank_test_score 1
Name: 4, dtype: object
which you can easily see that has the same parameters with our best_estimator_
above. But now you can inspect all the "best" models, simply by:
cv_results.loc[cv_results['rank_test_score']==1]
which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576
models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.
Hopefully these will be enough to get you started...
来源:https://stackoverflow.com/questions/61687800/retrieving-specific-classifiers-and-data-from-gridsearchcv