Apache Spark update a row in an RDD or Dataset based on another row

冷暖自知 提交于 2020-01-24 21:06:21

问题


I'm trying to figure how I can update some rows based on another another row.

For example, I have some data like

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

I want to update the users in the same city to the same groupId (either 1 or 2)

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

How can I achieve this in my RDD or Dataset ?

So just for sake of completeness, what if the Id is a String, the dense rank won't work ?

For example ?

Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

So the result looks like this:

grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

回答1:


A clean way to do this would be to use dense_rank() from Window functions. It enumerates the unique values in your Window column. Because city is a String column, these will be increasing alphabetically.

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+



回答2:


Try:

df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))


来源:https://stackoverflow.com/questions/40047620/apache-spark-update-a-row-in-an-rdd-or-dataset-based-on-another-row

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!