Stratified Train/Validation/Test-split in scikit-learn

前端未结

关注

 2  2160

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of ho

相关标签:

2条回答

北恋

2021-02-15 15:10

The solution is to just use StratifiedShuffleSplit twice, like below:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
    train_set = df.iloc[train_index]
    test_valid_set = df.iloc[test_valid_index]

split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
    test_set = test_valid_set.iloc[test_index]
    valid_set = test_valid_set.iloc[valid_index]

0 讨论(0)

被撕碎了的回忆

2021-02-15 15:24

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

0 讨论(0)
发布评论:

提交评论
- 加载中...