Python pandas groupby conditional concatenate strings into multiple columns

问题

I am trying to group by a dataframe on one column, keeping several columns from one row in each group and concatenating strings from the other rows into multiple columns based on the value of one column. Here is an example...

df = pd.DataFrame({'test' : ['a','a','a','a','a','a','b','b','b','b'],
     'name' : ['aa','ab','ac','ad','ae','ba','bb','bc','bd','be'],
     'amount' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 9.5],
     'role' : ['x','y','y','x','x','z','y','y','z','y']})

      amount    name    role    test
0        1.0    aa      x       a
1        2.0    ab      y       a
2        3.0    ac      y       a
3        4.0    ad      x       a
4        5.0    ae      x       a
5        6.0    ba      z       a
6        7.0    bb      y       b
7        8.0    bc      y       b
8        9.0    bd      z       b
9        9.5    be      y       b

I would like to groupby on test, retain name and amount when role = 'z', create a column (let's call it X) that concatenates the values of name when role = 'x' and another column (let's call it Y) that concatenates the values of name when role = 'y'. [Concatenated values separated by '; '] There could be zero to many rows with role = 'x', zero to many rows with role = 'y' and one row with role = 'z' per value of test. For X and Y, these can be null if there are no rows for that role for that test. The amount value is dropped for all rows with role = 'x' or 'y'. The desired output would be something like:

     test   name     amount        X              Y
0    a      ba          6.0        aa; ad; ae     ab; ac
1    b      bd          9.0        None           bb; bc; be

For the concatenating part, I found x.ix[x.role == 'x', X] = "{%s}" % '; '.join(x['name']), which I might be able to repeat for y. I tried a few things along the lines of name = x[x.role == 'z'].name.first() for name and amount. I also tried going down both paths of a defined function and a lambda function without success. Appreciate any thoughts.

回答1:

You can create customized columns in the apply function after groupby as follows where g can be considered a sub data frame with a single value in the test column, and since you want multiple columns returned, you need to create a Series object for each group where the indices are the corresponding headers in the result:

df.groupby('test').apply(lambda g: pd.Series({'name': g['name'][g.role == 'z'].iloc[0],
                                              'amount': g['amount'][g.role == 'z'].iloc[0], 
                                              'X': '; '.join(g['name'][g.role == 'x']), 
                                              'Y': '; '.join(g['name'][g.role == 'y'])
                                             })).reset_index()

回答2:

# set index and get crossection where test is 'z'
z = df.set_index(['test', 'role']).xs('z', level='role')
# get rid of 'z' rows and group by 'test' and 'role' to join names
xy = df.query('role != "z"').groupby(['test', 'role'])['name'].apply(';'.join).unstack()
# make columns of xy upper case
xy.columns = xy.columns.str.upper()

pd.concat([z, xy], axis=1).reset_index()

来源：https://stackoverflow.com/questions/40519697/python-pandas-groupby-conditional-concatenate-strings-into-multiple-columns

标签

python

pandas

group-by

conditional

string-concatenation