random sampling with Pandas data frame disjoint groups

。_饼干妹妹 提交于 2019-12-05 16:43:14

Using sklearn.model_selection.GroupShuffleSplit to perform the split:

from sklearn.model_selection import GroupShuffleSplit

# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)

# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))

# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]

UPDATE:

df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]

OLD incorrect (it doesn't care of separating IDs) answer:

you can first shuffle your DF using sample(frac=1) and then use np.split():

df1, df2 = np.split(df.sample(frac=1), 2)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!