How to groupby and pivot a dataframe with non-numeric values

问题

I'm using Python, and I have a dataset of 6 columns, R, Rc, J, T, Ca and Cb. I need to "aggregate" on the columns "R" then "J", so that for each R, each row is a unique "J". Rc is a characteristic of R. Ca and Cb are characteristics of T. It will make more sense looking at the table below.

I need to go from:

#______________________            ________________________________________________________________
#| R  Rc  J  T  Ca  Cb|           |# R  Rc  J  Ca(T=1)  Ca(T=2)  Ca(T=3)  Cb(T=1)  Cb(T=2)  Cb(T=3)|
#| a   p  1  1  x    d|           |# a  p   1    x         y        z        d        e        f   |
#| a   p  1  2  y    e|           |# b  o   1    w                           g                     |  
#| a   p  1  3  z    f|  ----->   |# b  o   2    v                           h                     | 
#| b   o  1  1  w    g|           |# b  o   3    s                           i                     |
#| b   o  2  1  v    h|           |# c  n   1    t         r                 j        k            |
#| b   o  3  1  s    i|           |# c  n   2    u                           l                     |
#| c   n  1  1  t    j|           |________________________________________________________________|
#| c   n  1  2  r    k|           
#| c   n  2  1  u    l|
#|____________________|

data = {'R' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
        'Rc': ['p', 'p', 'p', 'o', 'o', 'o', 'n', 'n', 'n'],
        'J' : [1, 1, 1, 1, 2, 3, 1, 1, 2], 
        'T' : [1, 2, 3, 1, 1, 1, 1, 2, 1], 
        'Ca': ['x', 'y', 'z', 'w', 'v', 's', 't', 'r', 'u'],
        'Cb': ['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']}

df = pd.DataFrame(data=data)

I don't want to lose the data in Rc, Ca, or Cb.

Rc (or each column that ends in 'c') is the same for each R, so that can just be grouped with R.

But Ca and Cb (or each column that starts with 'C') are unique for each T, which will be aggregated and otherwise lost. These need to instead be saved in new columns named Ca(T=1) for when T=1, Ca(T=2) for when T=2, and Ca(T=3) for when T=3. The same goes for Cb.

So using T, I need to create T number of columns for each Ca and Cb given T, that writes the data from Ca and Cb into the new columns.

PS. If it helps, columns J and T both have an extra column with unique IDs.

J_ID = [1,1,1,2,3,4,5,5,6]
T_ID = [1,2,3,4,5,6,7,8,9]

What I tried so far:

(
    df.groupby(['R','J'])
    .apply(lambda x: x.Ca.tolist()).apply(pd.Series)
    .rename(columns=lambda x: f'Ca{x+1}')
    .reset_index()
)

Problem: Only possible to do with one of the C's and I lose Rc.

Any help would be greatly appreciated!

回答1:

You can use pivot_table (here the docs) with a lambda function as aggfunc argument:

table = pd.pivot_table(df, index = ['R','Rc','J'],values = ['Ca','Cb'],
                    columns = ['T'], fill_value = '', aggfunc = lambda x: ''.join(str(v) for v in x)).reset_index()


   R Rc  J Ca       Cb      
T           1  2  3  1  2  3
0  a  p  1  x  y  z  d  e  f
1  b  o  1  w        g      
2  b  o  2  v        h      
3  b  o  3  s        i      
4  c  n  1  t  r     j  k   
5  c  n  2  u        l

Then you can remove the multiindex columns and rename as follow (taken from this great answer):

table.columns = ['%s%s' % (a, ' (T = %s)' % b if b else '') for a, b in table.columns]

   R Rc  J Ca (T = 1) Ca (T = 2) Ca (T = 3) Cb (T = 1) Cb (T = 2) Cb (T = 3)
0  a  p  1          x          y          z          d          e          f
1  b  o  1          w                                g                      
2  b  o  2          v                                h                      
3  b  o  3          s                                i                      
4  c  n  1          t          r                     j          k           
5  c  n  2          u                                l