how to aggregate only the numerical columns in a mixed dtypes dataframe

爷,独闯天下 提交于 2021-01-27 18:43:45

问题


I have a mixed pd.DataFrame:

import pandas as pd
import numpy as np
df = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Timestamp('20180101'),
                     'D' : np.random.rand(10),
                     'F' : 'foo' })

df
Out[12]: 
     A          B          C         D    F
0  1.0 2013-01-02 2018-01-01  0.592533  foo
1  1.0 2013-01-02 2018-01-01  0.819248  foo
2  1.0 2013-01-02 2018-01-01  0.298035  foo
3  1.0 2013-01-02 2018-01-01  0.330128  foo
4  1.0 2013-01-02 2018-01-01  0.371705  foo
5  1.0 2013-01-02 2018-01-01  0.541246  foo
6  1.0 2013-01-02 2018-01-01  0.976108  foo
7  1.0 2013-01-02 2018-01-01  0.423069  foo
8  1.0 2013-01-02 2018-01-01  0.863764  foo
9  1.0 2013-01-02 2018-01-01  0.037085  foo

I would like to aggregate my numerical columns, but keep also the non-numerical ones. If I do a gropuby followed by agg. I get:

df.groupby('B').agg(np.median)
Out[13]: 
              A         D
B                        
2013-01-02  1.0  0.482157

which is fine, and I know is desired behavior as the other dtypes probably raise exceptions during np.median, but I would like to get also my original column F with value foo, as well as C with 2018-01-01

So far, I have solved with a custom wrapper to my numerical aggregation functions e.g. if I wanted to do a nanmean over my dataframe:

def my_nan_median(x):
    if isinstance(x.values[0], np.datetime64):
        return np.min(x) # let the first datetime pass! 
    elif isinstance(x.values[0], str):
        return x.values[0] # let the strings pass!
    else:
        return np.nanmedian(x) 

but it looks awful. What is the right way to do so?


回答1:


By using select_dtypes:

df.groupby(list(df.select_dtypes(exclude=[np.number]))).agg(np.median).reset_index()

Or something like this:

df1 = df.groupby('B',as_index=False).agg(np.median)
pd.concat([df1,df.drop_duplicates(['B']).drop(list(df1),1).reset_index(drop=True)],axis=1)



回答2:


If 'C', 'F' are the same for each value of 'B', then you can include it in the groupby columns, like this:

df.groupby(['B','C','F']).agg(np.median).reset_index()

Or as @BradSolomn suggests:

df.groupby(['B','C','F'], as_index=False).agg(np.median)

Output:

           B          C    F    A         D
0 2013-01-02 2018-01-01  foo  1.0  0.392723

If not, then you'll need to aggregrate 'C', 'F' somehow for example get the get the first value from 'C', 'F'

df.groupby('B').agg({'D':np.median,'A':np.median,'C':'first','F':'last'}).reset_index() 

           B          C    F    A         D
0 2013-01-02 2018-01-01  foo  1.0  0.392723


来源:https://stackoverflow.com/questions/46773467/how-to-aggregate-only-the-numerical-columns-in-a-mixed-dtypes-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!