pyspark: groupby and aggregate avg and first on multiple columns

前端 未结 1 1499
渐次进展
渐次进展 2021-01-27 11:00

I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it indiv

相关标签:
1条回答
  • 2021-01-27 11:37

    The best way for multiple functions on multiple columns is to use the .agg(*expr) format.

    import pyspark.sql.functions as F
    from pyspark.sql.functions import udf
    from pyspark.sql.types import *
    import numpy as np
    #Test data
    tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
    fn_l = [F.min,F.max,F.mean,F.first]
    col_l=['col1','col2','col3']
    expr = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
    tst_r = tst.groupby('col4').agg(*expr)
    

    The result will be

    tst_r.show()
    +----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
    |col4|min_col1|min_col2|min_col3|max_col1|max_col2|max_col3|mean_col1|mean_col2|mean_col3|first_col1|first_col2|first_col3|
    +----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
    |   5|       5|       6|       7|       7|       8|       9|      6.0|      7.0|      8.0|         5|         6|         7|
    |   4|       1|       2|       3|       3|       4|       5|      2.0|      3.0|      4.0|         1|         2|         3|
    +----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
    

    For selectively applying functions on columns, you can have multiple expression arrays and concatenate them in aggregation.

    fn_l = [F.min,F.max]
    fn_2=[F.mean,F.first]
    col_l=['col1','col2']
    col_2=['col1','col3','col4']
    expr1 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
    expr2 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_2 for coln in col_2]
    tst_r = tst.groupby('col4').agg(*(expr1+expr2))
    
    0 讨论(0)
提交回复
热议问题