sklearn train_test_split on pandas stratify by multiple columns

后端 未结 3 1570
借酒劲吻你
借酒劲吻你 2021-01-31 16:48

I\'m a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to

3条回答
  •  长情又很酷
    2021-01-31 17:47

    If you want train_test_split to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.

    df['bc'] = df['b'].astype(str) + df['c'].astype(str)
    train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
    

    If you're worried about collision due to values like 11 and 3 and 1 and 13 both creating a concatenated value of 113, then you can add some arbitrary string in the middle:

    df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)
    

提交回复
热议问题