I\'m a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to
If you want train_test_split
to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.
df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
If you're worried about collision due to values like 11
and 3
and 1
and 13
both creating a concatenated value of 113
, then you can add some arbitrary string in the middle:
df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)