问题
I have a mixed pd.DataFrame
:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Timestamp('20180101'),
'D' : np.random.rand(10),
'F' : 'foo' })
df
Out[12]:
A B C D F
0 1.0 2013-01-02 2018-01-01 0.592533 foo
1 1.0 2013-01-02 2018-01-01 0.819248 foo
2 1.0 2013-01-02 2018-01-01 0.298035 foo
3 1.0 2013-01-02 2018-01-01 0.330128 foo
4 1.0 2013-01-02 2018-01-01 0.371705 foo
5 1.0 2013-01-02 2018-01-01 0.541246 foo
6 1.0 2013-01-02 2018-01-01 0.976108 foo
7 1.0 2013-01-02 2018-01-01 0.423069 foo
8 1.0 2013-01-02 2018-01-01 0.863764 foo
9 1.0 2013-01-02 2018-01-01 0.037085 foo
I would like to aggregate my numerical columns, but keep also the non-numerical ones.
If I do a gropuby
followed by agg
.
I get:
df.groupby('B').agg(np.median)
Out[13]:
A D
B
2013-01-02 1.0 0.482157
which is fine, and I know is desired behavior as the other dtypes probably raise exceptions during np.median, but I would like to get also my original column F
with value foo
, as well as C
with 2018-01-01
So far, I have solved with a custom wrapper to my numerical aggregation functions e.g. if I wanted to do a nanmean over my dataframe:
def my_nan_median(x):
if isinstance(x.values[0], np.datetime64):
return np.min(x) # let the first datetime pass!
elif isinstance(x.values[0], str):
return x.values[0] # let the strings pass!
else:
return np.nanmedian(x)
but it looks awful. What is the right way to do so?
回答1:
By using select_dtypes
:
df.groupby(list(df.select_dtypes(exclude=[np.number]))).agg(np.median).reset_index()
Or something like this:
df1 = df.groupby('B',as_index=False).agg(np.median)
pd.concat([df1,df.drop_duplicates(['B']).drop(list(df1),1).reset_index(drop=True)],axis=1)
回答2:
If 'C', 'F' are the same for each value of 'B', then you can include it in the groupby columns, like this:
df.groupby(['B','C','F']).agg(np.median).reset_index()
Or as @BradSolomn suggests:
df.groupby(['B','C','F'], as_index=False).agg(np.median)
Output:
B C F A D
0 2013-01-02 2018-01-01 foo 1.0 0.392723
If not, then you'll need to aggregrate 'C', 'F' somehow for example get the get the first value from 'C', 'F'
df.groupby('B').agg({'D':np.median,'A':np.median,'C':'first','F':'last'}).reset_index()
B C F A D
0 2013-01-02 2018-01-01 foo 1.0 0.392723
来源:https://stackoverflow.com/questions/46773467/how-to-aggregate-only-the-numerical-columns-in-a-mixed-dtypes-dataframe