How to groupby and pivot a dataframe with non-numeric values

无人久伴 提交于 2020-05-29 11:38:10

问题


I'm using Python, and I have a dataset of 6 columns, R, Rc, J, T, Ca and Cb. I need to "aggregate" on the columns "R" then "J", so that for each R, each row is a unique "J". Rc is a characteristic of R. Ca and Cb are characteristics of T. It will make more sense looking at the table below.

I need to go from:

#______________________            ________________________________________________________________
#| R  Rc  J  T  Ca  Cb|           |# R  Rc  J  Ca(T=1)  Ca(T=2)  Ca(T=3)  Cb(T=1)  Cb(T=2)  Cb(T=3)|
#| a   p  1  1  x    d|           |# a  p   1    x         y        z        d        e        f   |
#| a   p  1  2  y    e|           |# b  o   1    w                           g                     |  
#| a   p  1  3  z    f|  ----->   |# b  o   2    v                           h                     | 
#| b   o  1  1  w    g|           |# b  o   3    s                           i                     |
#| b   o  2  1  v    h|           |# c  n   1    t         r                 j        k            |
#| b   o  3  1  s    i|           |# c  n   2    u                           l                     |
#| c   n  1  1  t    j|           |________________________________________________________________|
#| c   n  1  2  r    k|           
#| c   n  2  1  u    l|
#|____________________|

data = {'R' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
        'Rc': ['p', 'p', 'p', 'o', 'o', 'o', 'n', 'n', 'n'],
        'J' : [1, 1, 1, 1, 2, 3, 1, 1, 2], 
        'T' : [1, 2, 3, 1, 1, 1, 1, 2, 1], 
        'Ca': ['x', 'y', 'z', 'w', 'v', 's', 't', 'r', 'u'],
        'Cb': ['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']}

df = pd.DataFrame(data=data)

I don't want to lose the data in Rc, Ca, or Cb.

Rc (or each column that ends in 'c') is the same for each R, so that can just be grouped with R.

But Ca and Cb (or each column that starts with 'C') are unique for each T, which will be aggregated and otherwise lost. These need to instead be saved in new columns named Ca(T=1) for when T=1, Ca(T=2) for when T=2, and Ca(T=3) for when T=3. The same goes for Cb.

So using T, I need to create T number of columns for each Ca and Cb given T, that writes the data from Ca and Cb into the new columns.

PS. If it helps, columns J and T both have an extra column with unique IDs.

J_ID = [1,1,1,2,3,4,5,5,6]
T_ID = [1,2,3,4,5,6,7,8,9]

What I tried so far:

(
    df.groupby(['R','J'])
    .apply(lambda x: x.Ca.tolist()).apply(pd.Series)
    .rename(columns=lambda x: f'Ca{x+1}')
    .reset_index()
)

Problem: Only possible to do with one of the C's and I lose Rc.

Any help would be greatly appreciated!


回答1:


You can use pivot_table (here the docs) with a lambda function as aggfunc argument:

table = pd.pivot_table(df, index = ['R','Rc','J'],values = ['Ca','Cb'],
                    columns = ['T'], fill_value = '', aggfunc = lambda x: ''.join(str(v) for v in x)).reset_index()


   R Rc  J Ca       Cb      
T           1  2  3  1  2  3
0  a  p  1  x  y  z  d  e  f
1  b  o  1  w        g      
2  b  o  2  v        h      
3  b  o  3  s        i      
4  c  n  1  t  r     j  k   
5  c  n  2  u        l      

Then you can remove the multiindex columns and rename as follow (taken from this great answer):

table.columns = ['%s%s' % (a, ' (T = %s)' % b if b else '') for a, b in table.columns]

   R Rc  J Ca (T = 1) Ca (T = 2) Ca (T = 3) Cb (T = 1) Cb (T = 2) Cb (T = 3)
0  a  p  1          x          y          z          d          e          f
1  b  o  1          w                                g                      
2  b  o  2          v                                h                      
3  b  o  3          s                                i                      
4  c  n  1          t          r                     j          k           
5  c  n  2          u                                l                      



回答2:


If I understand what you need, you can simply locate the needed rows like this:

df['Ca(T=1)']=df['Ca'].loc[df['T']==1]

you have to repeat it for the different T's



来源:https://stackoverflow.com/questions/61318800/how-to-groupby-and-pivot-a-dataframe-with-non-numeric-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!