How to select the first row of each group?

前端 未结 8 778
心在旅途
心在旅途 2020-11-21 05:49

I have a DataFrame generated as follow:

df.groupBy($\"Hour\", $\"Category\")
  .agg(sum($\"value\") as \"TotalValue\")
  .sort($\"Hour\".asc, $\"TotalValue\"         


        
相关标签:
8条回答
  • 2020-11-21 06:53

    The pattern is group by keys => do something to each group e.g. reduce => return to dataframe

    I thought the Dataframe abstraction is a bit cumbersome in this case so I used RDD functionality

     val rdd: RDD[Row] = originalDf
      .rdd
      .groupBy(row => row.getAs[String]("grouping_row"))
      .map(iterableTuple => {
        iterableTuple._2.reduce(reduceFunction)
      })
    
    val productDf = sqlContext.createDataFrame(rdd, originalDf.schema)
    
    0 讨论(0)
  • 2020-11-21 06:55

    This is a exact same of zero323's answer but in SQL query way.

    Assuming that dataframe is created and registered as

    df.createOrReplaceTempView("table")
    //+----+--------+----------+
    //|Hour|Category|TotalValue|
    //+----+--------+----------+
    //|0   |cat26   |30.9      |
    //|0   |cat13   |22.1      |
    //|0   |cat95   |19.6      |
    //|0   |cat105  |1.3       |
    //|1   |cat67   |28.5      |
    //|1   |cat4    |26.8      |
    //|1   |cat13   |12.6      |
    //|1   |cat23   |5.3       |
    //|2   |cat56   |39.6      |
    //|2   |cat40   |29.7      |
    //|2   |cat187  |27.9      |
    //|2   |cat68   |9.8       |
    //|3   |cat8    |35.6      |
    //+----+--------+----------+
    

    Window function :

    sqlContext.sql("select Hour, Category, TotalValue from (select *, row_number() OVER (PARTITION BY Hour ORDER BY TotalValue DESC) as rn  FROM table) tmp where rn = 1").show(false)
    //+----+--------+----------+
    //|Hour|Category|TotalValue|
    //+----+--------+----------+
    //|1   |cat67   |28.5      |
    //|3   |cat8    |35.6      |
    //|2   |cat56   |39.6      |
    //|0   |cat26   |30.9      |
    //+----+--------+----------+
    

    Plain SQL aggregation followed by join:

    sqlContext.sql("select Hour, first(Category) as Category, first(TotalValue) as TotalValue from " +
      "(select Hour, Category, TotalValue from table tmp1 " +
      "join " +
      "(select Hour as max_hour, max(TotalValue) as max_value from table group by Hour) tmp2 " +
      "on " +
      "tmp1.Hour = tmp2.max_hour and tmp1.TotalValue = tmp2.max_value) tmp3 " +
      "group by tmp3.Hour")
      .show(false)
    //+----+--------+----------+
    //|Hour|Category|TotalValue|
    //+----+--------+----------+
    //|1   |cat67   |28.5      |
    //|3   |cat8    |35.6      |
    //|2   |cat56   |39.6      |
    //|0   |cat26   |30.9      |
    //+----+--------+----------+
    

    Using ordering over structs:

    sqlContext.sql("select Hour, vs.Category, vs.TotalValue from (select Hour, max(struct(TotalValue, Category)) as vs from table group by Hour)").show(false)
    //+----+--------+----------+
    //|Hour|Category|TotalValue|
    //+----+--------+----------+
    //|1   |cat67   |28.5      |
    //|3   |cat8    |35.6      |
    //|2   |cat56   |39.6      |
    //|0   |cat26   |30.9      |
    //+----+--------+----------+
    

    DataSets way and don't dos are same as in original answer

    0 讨论(0)
提交回复
热议问题