问题
I need to randomly separate a data frame into two disjoint sets by the attribute 'ids'
. For example, consider the following data frame:
df=
Out[470]:
0 1 2 3 ids
0 17.0 18.0 16.0 15.0 13.0
1 18.0 16.0 15.0 15.0 13.0
2 16.0 15.0 15.0 16.0 13.0
131 12.0 8.0 21.0 19.0 14.0
132 8.0 21.0 19.0 20.0 14.0
133 21.0 19.0 20.0 9.0 14.0
248 NaN NaN 12.0 11.0 17.0
249 NaN 12.0 11.0 10.0 17.0
250 12.0 11.0 10.0 NaN 17.0
287 3.0 3.0 1.0 8.0 20.0
288 3.0 1.0 8.0 3.0 20.0
289 1.0 8.0 3.0 3.0 20.0
413 21.0 7.0 16.0 18.0 25.0
414 7.0 16.0 18.0 19.0 25.0
415 16.0 18.0 19.0 18.0 25.0
665 10.0 8.0 8.0 7.0 27.0
666 8.0 8.0 7.0 9.0 27.0
667 8.0 7.0 9.0 8.0 27.0
790 NaN NaN 15.0 NaN 33.0
791 NaN 15.0 NaN 10.0 33.0
792 15.0 NaN 10.0 NaN 33.0
812 NaN 16.0 NaN 17.0 34.0
813 16.0 NaN 17.0 NaN 34.0
814 NaN 17.0 NaN 13.0 34.0
944 3.0 4.0 3.0 18.0 35.0
945 4.0 3.0 18.0 18.0 35.0
946 3.0 18.0 18.0 11.0 35.0
1059 9.0 10.0 3.0 4.0 56.0
1060 10.0 3.0 4.0 3.0 56.0
1061 3.0 4.0 3.0 3.0 56.0
... ... ... ... ...
10125 NaN 9.0 5.0 5.0 101317.0
10126 9.0 5.0 5.0 5.0 101317.0
10127 5.0 5.0 5.0 7.0 101317.0
I need to get two (randomly separated with some fraction size) dataframes with no intersecting values of ids
.
I know how to solve it in 'non-pandasian' way:
- get the unique values of the
ids
- randomly split the unique values into two disjoint groups
- select row according to values of
ids
in two groups using.isin()
I am wondering whether there is a simple and neat way to do it with some pandas built-in function, like .sample()
?
回答1:
Using sklearn.model_selection.GroupShuffleSplit to perform the split:
from sklearn.model_selection import GroupShuffleSplit
# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)
# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))
# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]
回答2:
UPDATE:
df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]
OLD incorrect (it doesn't care of separating IDs) answer:
you can first shuffle your DF using sample(frac=1)
and then use np.split():
df1, df2 = np.split(df.sample(frac=1), 2)
来源:https://stackoverflow.com/questions/44007496/random-sampling-with-pandas-data-frame-disjoint-groups