sklearn train_test_split on pandas stratify by multiple columns

后端未结

关注

 3  1570

借酒劲吻你 2021-01-31 16:48

I\'m a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to

3条回答

长情又很酷 (楼主)

2021-01-31 17:47
If you want train_test_split to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.
```
df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
```
If you're worried about collision due to values like 11 and 3 and 1 and 13 both creating a concatenated value of 113, then you can add some arbitrary string in the middle:
```
df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...