问题
Could anyone check problems with the following code? Am I wrong in any steps in building my model? I already added two 'clf__' to parameters.
clf=RandomForestClassifier()
pca = PCA()
pca_clf = make_pipeline(pca, clf)
kfold = KFold(n_splits=10, random_state=22)
parameters = {'clf__n_estimators': [4, 6, 9], 'clf__max_features': ['log2',
'sqrt','auto'],'clf__criterion': ['entropy', 'gini'], 'clf__max_depth': [2,
3, 5, 10], 'clf__min_samples_split': [2, 3, 5],
'clf__min_samples_leaf': [1,5,8] }
grid_RF=GridSearchCV(pca_clf,param_grid=parameters,
scoring='accuracy',cv=kfold)
grid_RF = grid_RF.fit(X_train, y_train)
clf = grid_RF.best_estimator_
clf.fit(X_train, y_train)
grid_RF.best_score_
cv_result = cross_val_score(clf,X_train,y_train, cv = kfold,scoring =
"accuracy")
cv_result.mean()
回答1:
You are assuming the usage of make_pipeline
in a wrong way. From the documentation:-
This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
So that means that when you supply a PCA object, its name will be set as 'pca' (lowercase) and when you supply a RandomForestClassifier object to it, it will be named as 'randomforestclassifier', not 'clf' as you are thinking.
So now the parameter grid you have made is invalid, because it contains clf__
and its not present in pipeline.
Solution 1 :
Replace this line:
pca_clf = make_pipeline(pca, clf)
With
pca_clf = Pipeline([('pca', pca), ('clf', clf)])
Solution 2 :
If you dont want to change the pca_clf = make_pipeline(pca, clf)
line, then replace all the occurences of clf inside your parameters
to 'randomforestclassifier' like this:
parameters = {'randomforestclassifier__n_estimators': [4, 6, 9],
'randomforestclassifier__max_features': ['log2', 'sqrt','auto'],
'randomforestclassifier__criterion': ['entropy', 'gini'],
'randomforestclassifier__max_depth': [2, 3, 5, 10],
'randomforestclassifier__min_samples_split': [2, 3, 5],
'randomforestclassifier__min_samples_leaf': [1,5,8] }
Sidenote: No need to do this in your code:
clf = grid_RF.best_estimator_
clf.fit(X_train, y_train)
The best_estimator_
will already be fitted with the whole data with best found params, so you calling clf.fit()
is redundant.
来源:https://stackoverflow.com/questions/48271342/invalid-parameter-clf-for-estimator-pipeline-in-sklearn