There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of ho
The solution is to just use StratifiedShuffleSplit twice, like below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
train_set = df.iloc[train_index]
test_valid_set = df.iloc[test_valid_index]
split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
test_set = test_valid_set.iloc[test_index]
valid_set = test_valid_set.iloc[valid_index]
Yes, this is exactly how I would do it - running train_test_split()
twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.
In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split()
again. Same if you use the (very handy!) model_selection.cross_val_score
function.