Alternate different models in Pipeline for GridSearchCV

后端 未结 2 1890
隐瞒了意图╮
隐瞒了意图╮ 2021-02-09 03:11

I want to build a Pipeline in sklearn and test different models using GridSearchCV.

Just an example (please do not pay attention on what particular models are chosen):

相关标签:
2条回答
  • 2021-02-09 03:18

    An alternative solution that does not require to prefix the estimators names in the parameter grid is the following:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    
    # the models that you want to compare
    models = {
        'RandomForestClassifier': RandomForestClassifier(),
        'KNeighboursClassifier': KNeighborsClassifier(),
        'LogisticRegression': LogisticRegression()
    }
    
    # the optimisation parameters for each of the above models
    params = {
        'RandomForestClassifier':{ 
                "n_estimators"      : [100, 200, 500, 1000],
                "max_features"      : ["auto", "sqrt", "log2"],
                "bootstrap": [True],
                "criterion": ['gini', 'entropy'],
                "oob_score": [True, False]
                },
        'KNeighboursClassifier': {
            'n_neighbors': np.arange(3, 15),
            'weights': ['uniform', 'distance'],
            'algorithm': ['ball_tree', 'kd_tree', 'brute']
            },
        'LogisticRegression': {
            'solver': ['newton-cg', 'sag', 'lbfgs'],
            'multi_class': ['ovr', 'multinomial']
            }  
    }
    

    and you can define:

    from sklearn.model_selection import GridSearchCV
    
    def fit(train_features, train_actuals):
            """
            fits the list of models to the training data, thereby obtaining in each 
            case an evaluation score after GridSearchCV cross-validation
            """
            for name in models.keys():
                est = models[name]
                est_params = params[name]
                gscv = GridSearchCV(estimator=est, param_grid=est_params, cv=5)
                gscv.fit(train_features, train_actuals)
                print("best parameters are: {}".format(gscv.best_estimator_))
    

    basically running through the different models, each model referring to its own set of optimisation parameters through a dictionary. Of course do not forget to pass the models and the parameters dictionary to the fit function, in case you do not have them as global variables. Have a look at this GitHub project for a more complete overview.

    0 讨论(0)
  • 2021-02-09 03:38

    Lets assume you want to use PCA and TruncatedSVD as your dimesionality reduction step.

    pca = decomposition.PCA()
    svd = decomposition.TruncatedSVD()
    svm = SVC()
    n_components = [20, 40, 64]
    

    You can do this:

    pipe = Pipeline(steps=[('reduction', pca), ('svm', svm)])
    
    # Change params_grid -> Instead of dict, make it a list of dict
    # In the first element, pass parameters related to pca, and in second related to svd
    
    params_grid = [{
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'reduction':pca,
    'reduction__n_components': n_components,
    },
    {
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'reduction':svd,
    'reduction__n_components': n_components,
    'reduction__algorithm':['randomized']
    }]
    

    and now just pass the pipeline object to gridsearchCV

    grd = GridSearchCV(pipe, param_grid = params_grid)
    

    Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

    Please look at my other answer for more details: "Parallel" pipeline to get best model using gridsearch

    0 讨论(0)
提交回复
热议问题