问题
I want to create on-hot encoded data from categorical data, which you can see here.
Label1 Label2 Label3
0 Street fashion Clothing Fashion
1 Clothing Outerwear Jeans
2 Architecture Property Clothing
3 Clothing Black Footwear
4 White Photograph Beauty
The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies
but this created data like:
Label1_Clothing Label2_Clothing Label3_Clothing
0 0 1 0
1 1 0 0
2 0 0 1
Is there a way to only have one dummy variable column for each label? So rather:
Label_Clothing Label_Street Fashion Label_Architecture
0 1 1 0
1 1 0 0
2 1 0 1
I am pretty new to programming and would be very glad for your help.
Best, Bernardo
回答1:
You can stack your dataframe into a single Series
then get the dummies from that. From there you take the maximum of the outer level to collapse the data back to its original shape while maintaining the position of the labels:
dummies = pd.get_dummies(df.stack()).max(level=0)
print(dummies)
Architecture Beauty Black Clothing Fashion Footwear Jeans Outerwear Photograph Property Street fashion White
0 0 0 0 1 1 0 0 0 0 0 1 0
1 0 0 0 1 0 0 1 1 0 0 0 0
2 1 0 0 1 0 0 0 0 0 1 0 0
3 0 0 1 1 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 1 0 0 1
来源:https://stackoverflow.com/questions/64667044/one-hot-encoding-for-words-which-occur-in-multiple-columns