问题
My question is: How can I transform a Data Frame like this to eventually use it in scikit's MulitLabelBinarizer:
d1 = {'ID':[1,2,3,4], 'km':[80,90,90,100], 'weight':[10,20,20,30], 'label':['A','B','C','D','E']}
df1 = pd.DataFrame(data=d1)
df1
ID km weight label
0 1 80 10 A
1 2 90 20 B
2 2 90 20 C
3 4 100 30 D
It should tourn ot like this:
d2 ={'km':[80,90,100], 'weight':[10,20,30], 'label':['A',('B','C'),'D']}
df2 = pd.DataFrame(data=d2)
df2
km weight label
0 80 10 A
1 90 20 (B, C)
2 100 30 D
So I can juse the data properly in the MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df2['label'])
mlb.transform(df2['label'])
array([[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 1]])
Note: the raw data has more than 1 million rows.
回答1:
I think you need this:
d1 = {'ID':[1,2,3,4], 'km':[80,90,90,100], 'weight':[10,20,20,30], 'label':['A','B','C','D']}
df1 = pd.DataFrame(data=d1)
#Groupby and get tuple, like you need
df2 = pd.DataFrame(df1.groupby(['km','weight'])['label'].apply(lambda x: tuple(x.values)))
df2.reset_index(inplace=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df2['label'])
mlb.transform(df2['label'])
来源:https://stackoverflow.com/questions/53494873/transform-pandas-data-frame-to-use-for-multilabelbinarizer