How to select the first row of each group?

前端 未结 8 808
心在旅途
心在旅途 2020-11-21 05:49

I have a DataFrame generated as follow:

df.groupBy($\"Hour\", $\"Category\")
  .agg(sum($\"value\") as \"TotalValue\")
  .sort($\"Hour\".asc, $\"TotalValue\"         


        
8条回答
  •  暗喜
    暗喜 (楼主)
    2020-11-21 06:41

    We can use the rank() window function (where you would choose the rank = 1) rank just adds a number for every row of a group (in this case it would be the hour)

    here's an example. ( from https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sql-functions.adoc#rank )

    val dataset = spark.range(9).withColumn("bucket", 'id % 3)
    
    import org.apache.spark.sql.expressions.Window
    val byBucket = Window.partitionBy('bucket).orderBy('id)
    
    scala> dataset.withColumn("rank", rank over byBucket).show
    +---+------+----+
    | id|bucket|rank|
    +---+------+----+
    |  0|     0|   1|
    |  3|     0|   2|
    |  6|     0|   3|
    |  1|     1|   1|
    |  4|     1|   2|
    |  7|     1|   3|
    |  2|     2|   1|
    |  5|     2|   2|
    |  8|     2|   3|
    +---+------+----+
    

提交回复
热议问题