What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?

倾然丶 夕夏残阳落幕 提交于 2019-11-30 10:15:42

问题


I'm having issues transitioning to pandas from R where dplyr package can easily group-by and perform multiple summarizations.

Please help improve my existing Python pandas code for multiple aggregations:

import pandas as pd
data = pd.DataFrame(
    {'col1':[1,1,1,1,1,2,2,2,2,2],
    'col2':[1,2,3,4,5,6,7,8,9,0],
     'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
    }
)
result = []
for k,v in data.groupby('col1'):
    result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])

Issues:

  • too verbose
  • probably can be optimized and efficient. (I rewrote a for-loop groupby implementation into groupby.agg and the performance enhancement was huge).

In R the equivalent code would be:

data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:

Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3)) i.e. an aggregation/summarization where the function is a compound function of 2+ columns?


回答1:


The equivalent of

df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

is

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})

which returns

      col2  col3
col1            
1        5    -5
2        9    -9

The returning object is a pandas.DataFrame with an index called col1 and columns named col2 and col3. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1 as a column.

  • Pass as_index=False:

    df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
    
  • Call reset_index:

    df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
    

both yield

col1  col2  col3           
   1     5    -5
   2     9    -9

You can also pass multiple functions to groupby.agg.

agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'], 
                                 'col3': ['size', 'std', 'mean', 'max']})

This also returns a DataFrame but now it has a MultiIndex for columns.

     col2               col3                   
      max min       std size       std mean max
col1                                           
1       5   1  1.581139    5  1.581139   -3  -1
2       9   0  3.535534    5  3.535534   -6   0

MultiIndex is very handy for selection and grouping. Here are some examples:

agg_df['col2']  # select the second column
      max  min       std
col1                    
1       5    1  1.581139
2       9    0  3.535534

agg_df[('col2', 'max')]  # select the maximum of the second column
Out: 
col1
1    5
2    9
Name: (col2, max), dtype: int64

agg_df.xs('max', axis=1, level=1)  # select the maximum of all columns
Out: 
      col2  col3
col1            
1        5    -1
2        9     0

Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the agg call. For example

df.groupby('col1')['col2'].agg({'max_col2': 'max'})

would return the maximum of the second column as max_col2:

      max_col2
col1          
1            5
2            9

However, it was deprecated in favor of the rename method:

df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})

      col2_max
col1          
1            5
2            9

It can get verbose for a DataFrame like agg_df defined above. You can use a renaming function to flatten those levels in that case:

agg_df.columns = ['_'.join(col) for col in agg_df.columns]

      col2_max  col2_min  col2_std  col3_size  col3_std  col3_mean  col3_max
col1                                                                        
1            5         1  1.581139          5  1.581139         -3        -1
2            9         0  3.535534          5  3.535534         -6         0

For operations like groupby().summarize(newcolumn=max(col2 * col3)), you can still use agg by first adding a new column with assign.

df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max') 

      col2  col3  new_col
col1                     
1        5    -1       -1
2        9     0        0

This returns maximum for old and new columns but as always you can slice that.

df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')

col1
1   -1
2    0
Name: new_col, dtype: int64

With groupby.apply this would be shorter:

df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())

col1
1   -1
2    0
dtype: int64

However, groupby.apply treats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg ('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min') with df.groupby('col1').agg(min), df.groupby('col1').agg(np.min) or df.groupby('col1').min() and they will all execute the same function. You will not see the same efficiency when you use custom functions.

Lastly, as of version 0.20, agg can be used on DataFrames directly, without having to group first. See examples here.




回答2:


Check the side-by-side comparison given by Pandas documentation here: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing

R's dplyr

gdf <- group_by(df, col1)
summarise(gdf, avg=mean(col1, na.rm=TRUE))  

Pandas

gdf = df.groupby('col1')
df.groupby('col1').agg({'col1': 'mean'})


来源:https://stackoverflow.com/questions/38935541/what-is-the-pandas-equivalent-of-dplyr-summarize-aggregate-by-multiple-functions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!