发表新帖

发表新帖

Find maximum row per group in Spark DataFrame

后端未结

关注

 2  989

-上瘾入骨i 2020-11-22 03:47

I\'m trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.

In a 14-nodes Google Da

2条回答

伪装坚强ぢ (楼主)

2020-11-22 04:28
I think what you might be looking for are window functions: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Here is an example in Scala (I don't have a Spark Shell with Hive available right now, so I was not able to test the code, but I think it should work):
```
case class MyRow(name: String, id_sa: String, id_sb: String)

val myDF = sc.parallelize(Array(
    MyRow("n1", "a1", "b1"),
    MyRow("n2", "a1", "b2"),
    MyRow("n3", "a1", "b2"),
    MyRow("n1", "a2", "b2")
)).toDF("name", "id_sa", "id_sb")

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy(myDF("id_sa")).orderBy(myDF("id_sb").desc)

myDF.withColumn("max_id_b", first(myDF("id_sb")).over(windowSpec).as("max_id_sb")).filter("id_sb = max_id_sb")
```
There are probably more efficient ways to achieve the same results with Window functions, but I hope this points you in the right direction.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题