Decode one-hot dataframe in Pandas

问题

I have 2 dataframes with the data as below:

df1:
====
id   name   age   likes
---  -----  ----  -----
0     A      21    rose
1     B      22    apple
2     C      30    grapes
4     D      21    lily

df2:
====
category    Fruit   Flower 
---------  -------  -------
orange      1        0
apple       1        0       
rose        0        1
lily        0        1
grapes      1        0

What I am trying to do is add another column to df1 which would contain the word 'Fruit' or 'Flower' depending on the one-hot encoding in df2 for that entry. I am looking for a purely pandas/numpy implementation.

Any help would be appreciated.

Thanks!

回答1:

IIUC, you can use .apply and set the axis=1 or axis="columns", which means apply function to each row.

df3 = df1.merge(df2, left_on='likes', right_on='category')

# you can add your one hot columns in here.
categories_col = ['Fruit','Flower']

def get_category(x):
    for category in categories_col:
        if x[category] == 1:
            return category
df1["new"] = df3.apply(get_category, axis=1)

print(df1)
    id  name    age likes   new
0   0   A   21  rose    Flower
1   1   B   22  apple   Fruit
2   2   C   30  grapes  Fruit  
3   4   D   21  lily    Flower

But make sure your dataframe of categories_col must be one hot encode.

回答2:

You can use apply() for that:

df1['type_string'] = df2.apply(lambda x: 'Fruit' if x.Fruit else 'Flower', 1)

Here is a running example:

import pandas as pd
from io import StringIO

df1 = pd.read_csv(StringIO(
"""
0     A      21    rose
1     B      22    apple
2     C      30    grapes
4     D      21    lily
"""), sep='\s+', header=None)

df2 = pd.read_csv(StringIO(
"""
orange      1        0
apple       1        0       
rose        0        1
lily        0        1
grapes      1        0
"""), sep='\s+', header=None)

df1.columns = ['id', 'name', 'age', 'likes']
df2.columns = ['category', 'Fruit', 'Flower']

df1['category'] = df2.apply(lambda x: 'Fruit' if x.Fruit else 'Flower', 1)

Input

   id name  age   likes
0   0    A   21    rose
1   1    B   22   apple
2   2    C   30  grapes
3   4    D   21    lily

Output

   id name  age   likes category
0   0    A   21    rose    Fruit
1   1    B   22   apple    Fruit
2   2    C   30  grapes   Flower
3   4    D   21    lily   Flower

回答3:

the trick lies in the fact that the two tables have different number of rows, also the examples above might not work if df2 has more categories than what is in df1.

here's a working example:

df1 = pd.DataFrame([['orange',12],['rose',3],['apple',44],['grapes',1]], columns = ['name', 'age'])


df1
    name    age
0   orange  12
1   rose    3
2   apple   44
3   grapes  1

df2 = pd.DataFrame([['orange',1],['rose',0],['apple',1],['grapes',1],['daffodils',0],['berries',1]], columns = ['cat', 'Fruit'])

df2
    cat         Fruit
0   orange      1
1   rose        0
2   apple       1
3   grapes      1
4   daffodils   0
5   berries     1

one single line, run a listcomp with a conditional statement and do the merged df1 and df2 on the fly where the key df1.name = df2.cat:

df1['flag'] = ['Fruit' if i == 1 else 'Flower' for i in df1.merge(df2,how='left',left_on='name', right_on='cat').Fruit]
df1

output

name    age     flag
0   orange  12  Fruit
1   rose    3   Flower
2   apple   44  Fruit
3   grapes  1   Fruit

来源：https://stackoverflow.com/questions/53078951/decode-one-hot-dataframe-in-pandas

标签

pandas