Using explicit (predefined) validation set for grid search with sklearn

前端 未结 3 1611
隐瞒了意图╮
隐瞒了意图╮ 2020-12-07 17:38

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across dif

相关标签:
3条回答
  • 2020-12-07 18:13
    # Import Libraries
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.model_selection import PredefinedSplit
    
    # Split Data to Train and Validation
    X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
    
    # Create a list where train data indices are -1 and validation data indices are 0
    split_index = [-1 if x in X_train.index else 0 for x in X.index]
    
    # Use the list to create PredefinedSplit
    pds = PredefinedSplit(test_fold = split_index)
    
    # Use PredefinedSplit in GridSearchCV
    clf = GridSearchCV(estimator = estimator,
                       cv=pds,
                       param_grid=param_grid)
    
    # Fit with all data
    clf.fit(X, y)
    
    0 讨论(0)
  • 2020-12-07 18:17

    Use PredefinedSplit

    ps = PredefinedSplit(test_fold=your_test_fold)
    

    then set cv=ps in GridSearchCV

    test_fold : “array-like, shape (n_samples,)

    test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

    Also see here

    when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.

    0 讨论(0)
  • 2020-12-07 18:25

    Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

    # Code from https://github.com/cgnorthcutt/hypopt
    # Assuming you already have train, test, val sets and a model.
    from hypopt import GridSearch
    param_grid = [
      {'C': [1, 10, 100], 'kernel': ['linear']},
      {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
     ]
    # Grid-search all parameter combinations using a validation set.
    opt = GridSearch(model = SVR(), param_grid = param_grid)
    opt.fit(X_train, y_train, X_val, y_val)
    print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
    

    EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.

    0 讨论(0)
提交回复
热议问题