How to generate a train-test-split based on a group id?

淺唱寂寞╮ 提交于 2020-08-21 06:55:11

问题


I have the following data:

pd.DataFrame({'Group_ID':[1,1,1,2,2,2,3,4,5,5],
          'Item_id':[1,2,3,4,5,6,7,8,9,10],
          'Target': [0,0,1,0,1,1,0,0,0,1]})

   Group_ID Item_id Target
0         1       1      0
1         1       2      0
2         1       3      1
3         2       4      0
4         2       5      1
5         2       6      1
6         3       7      0
7         4       8      0
8         5       9      0
9         5      10      1

I need to split the dataset into a training and testing set based on the "Group_ID" so that 80% of the data goes into a training set and 20% into a test set.

That is, I need my training set to look something like:

    Group_ID Item_id Target
0          1       1      0
1          1       2      0
2          1       3      1
3          2       4      0
4          2       5      1
5          2       6      1
6          3       7      0
7          4       8      0

And test set:

Test Set
   Group_ID Item_id Target
8         5       9      0
9         5      10      1

What would be the simplest way to do this? As far as I know, the standard test_train_split function in sklearn does not support splitting by groups in a way where I can also indicate the size of the split (e.g. 80/20).


回答1:


I figured out the answer. This seems to work:

train_inds, test_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(df, groups=df['Group_Id']))

train = df.iloc[train_inds]
test = df.iloc[test_inds]


来源:https://stackoverflow.com/questions/54797508/how-to-generate-a-train-test-split-based-on-a-group-id

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!