One-hot encoding for words which occur in multiple columns

问题

I want to create on-hot encoded data from categorical data, which you can see here.

        Label1          Label2        Label3  
0   Street fashion        Clothing       Fashion
1         Clothing       Outerwear         Jeans
2     Architecture        Property      Clothing
3         Clothing           Black      Footwear
4            White      Photograph        Beauty

The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies but this created data like:

Label1_Clothing  Label2_Clothing    Label3_Clothing  
0      0                 1                 0
1      1                 0                 0
2      0                 0                 1

Is there a way to only have one dummy variable column for each label? So rather:

Label_Clothing  Label_Street Fashion    Label_Architecture  
0      1                 1                 0
1      1                 0                 0
2      1                 0                 1

I am pretty new to programming and would be very glad for your help.

Best, Bernardo

回答1:

You can stack your dataframe into a single Series then get the dummies from that. From there you take the maximum of the outer level to collapse the data back to its original shape while maintaining the position of the labels:

dummies = pd.get_dummies(df.stack()).max(level=0)

print(dummies)
   Architecture  Beauty  Black  Clothing  Fashion  Footwear  Jeans  Outerwear  Photograph  Property  Street fashion  White
0             0       0      0         1        1         0      0          0           0         0               1      0
1             0       0      0         1        0         0      1          1           0         0               0      0
2             1       0      0         1        0         0      0          0           0         1               0      0
3             0       0      1         1        0         1      0          0           0         0               0      0
4             0       1      0         0        0         0      0          0           1         0               0      1

来源：https://stackoverflow.com/questions/64667044/one-hot-encoding-for-words-which-occur-in-multiple-columns

标签

python

pandas

machine-learning

one-hot-encoding

dummy-variable

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!