How to select the first row of each group?

前端 未结 8 792
心在旅途
心在旅途 2020-11-21 05:49

I have a DataFrame generated as follow:

df.groupBy($\"Hour\", $\"Category\")
  .agg(sum($\"value\") as \"TotalValue\")
  .sort($\"Hour\".asc, $\"TotalValue\"         


        
8条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-21 06:39

    The solution below does only one groupBy and extract the rows of your dataframe that contain the maxValue in one shot. No need for further Joins, or Windows.

    import org.apache.spark.sql.Row
    import org.apache.spark.sql.catalyst.encoders.RowEncoder
    import org.apache.spark.sql.DataFrame
    
    //df is the dataframe with Day, Category, TotalValue
    
    implicit val dfEnc = RowEncoder(df.schema)
    
    val res: DataFrame = df.groupByKey{(r) => r.getInt(0)}.mapGroups[Row]{(day: Int, rows: Iterator[Row]) => i.maxBy{(r) => r.getDouble(2)}}
    

提交回复
热议问题