问题
I want to score different classifiers with different parameters.
For speedup on LogisticRegression
I use LogisticRegressionCV
(which at least 2x faster) and plan use GridSearchCV
for others.
But problem while it give me equal C
parameters, but not the AUC ROC
scoring.
I'll try fix many parameters like scorer
, random_state
, solver
, max_iter
, tol
...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg
used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By @joeln comment add max_iter=10000
and tol=10
parameters too. It does not change result in any digit, but the warning disappeared.
回答1:
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_
gives the score for all the folds.
GridSearchCV.best_score_
gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4
instead of your tol=10
, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV
(which is actually what makes it faster than GridSearchCV
).
来源:https://stackoverflow.com/questions/36271166/how-to-get-comparable-and-reproducible-results-from-logisticregressioncv-and-gri