Stratified Train/Validation/Test-split in scikit-learn

前端 未结 2 2149
深忆病人
深忆病人 2021-02-15 14:51

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of ho

相关标签:
2条回答
  • 2021-02-15 15:10

    The solution is to just use StratifiedShuffleSplit twice, like below:

    from sklearn.model_selection import StratifiedShuffleSplit
    
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
    for train_index, test_valid_index in split.split(df, df.target):
        train_set = df.iloc[train_index]
        test_valid_set = df.iloc[test_valid_index]
    
    split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
    for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
        test_set = test_valid_set.iloc[test_index]
        valid_set = test_valid_set.iloc[valid_index]
    
    0 讨论(0)
  • 2021-02-15 15:24

    Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

    In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

    0 讨论(0)
提交回复
热议问题