I want to build a Pipeline in sklearn and test different models using GridSearchCV.
Just an example (please do not pay attention on what particular models are chosen):
An alternative solution that does not require to prefix the estimators names in the parameter grid is the following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
# the models that you want to compare
models = {
'RandomForestClassifier': RandomForestClassifier(),
'KNeighboursClassifier': KNeighborsClassifier(),
'LogisticRegression': LogisticRegression()
}
# the optimisation parameters for each of the above models
params = {
'RandomForestClassifier':{
"n_estimators" : [100, 200, 500, 1000],
"max_features" : ["auto", "sqrt", "log2"],
"bootstrap": [True],
"criterion": ['gini', 'entropy'],
"oob_score": [True, False]
},
'KNeighboursClassifier': {
'n_neighbors': np.arange(3, 15),
'weights': ['uniform', 'distance'],
'algorithm': ['ball_tree', 'kd_tree', 'brute']
},
'LogisticRegression': {
'solver': ['newton-cg', 'sag', 'lbfgs'],
'multi_class': ['ovr', 'multinomial']
}
}
and you can define:
from sklearn.model_selection import GridSearchCV
def fit(train_features, train_actuals):
"""
fits the list of models to the training data, thereby obtaining in each
case an evaluation score after GridSearchCV cross-validation
"""
for name in models.keys():
est = models[name]
est_params = params[name]
gscv = GridSearchCV(estimator=est, param_grid=est_params, cv=5)
gscv.fit(train_features, train_actuals)
print("best parameters are: {}".format(gscv.best_estimator_))
basically running through the different models, each model referring to its own set of optimisation parameters through a dictionary. Of course do not forget to pass the models and the parameters dictionary to the fit
function, in case you do not have them as global variables. Have a look at this GitHub project for a more complete overview.
Lets assume you want to use PCA and TruncatedSVD as your dimesionality reduction step.
pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]
You can do this:
pipe = Pipeline(steps=[('reduction', pca), ('svm', svm)])
# Change params_grid -> Instead of dict, make it a list of dict
# In the first element, pass parameters related to pca, and in second related to svd
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':pca,
'reduction__n_components': n_components,
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':svd,
'reduction__n_components': n_components,
'reduction__algorithm':['randomized']
}]
and now just pass the pipeline object to gridsearchCV
grd = GridSearchCV(pipe, param_grid = params_grid)
Calling grd.fit()
will search the parameters over both the elements of the params_grid list, using all values from one
at a time.
Please look at my other answer for more details: "Parallel" pipeline to get best model using gridsearch