Scala-Spark Dynamically call groupby and agg with parameter values

前端 未结 1 1701
有刺的猬
有刺的猬 2020-12-03 19:34

I want to write a custom grouping and aggregate function to get user specified column names and user specified aggregation map.I do not know the column names and agg

相关标签:
1条回答
  • 2020-12-03 19:50

    Your code is almost correct - with two issues:

    1. The return type of your function is DataFrame, but the last line is aggregated.show(), which returns Unit. Remove the call to show to return aggregated itself, or just return the result of agg immediately

    2. DataFrame.groupBy expects arguments as follows: col1: String, cols: String* - so you need to pass matching arguments: the first columns, and then the rest of the columns as a list of arguments, you can do that as follows: df.groupBy(cols.head, cols.tail: _*)

    Altogether, your function would be:

    def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame ={
      val grouped = df.groupBy(cols.head, cols.tail: _*)
      val aggregated = grouped.agg(aggregateFun)
      aggregated
    }
    

    Or, a similar shorter version:

    def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame = {
      df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
    }
    

    If you do want to call show within your function:

    def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame ={
      val grouped = df.groupBy(cols.head, cols.tail: _*)
      val aggregated = grouped.agg(aggregateFun)
      aggregated.show()
      aggregated
    }
    
    0 讨论(0)
提交回复
热议问题