问题
I'm trying to figure how I can update some rows based on another another row.
For example, I have some data like
Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...
I want to update the users in the same city to the same groupId (either 1 or 2)
Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...
How can I achieve this in my RDD or Dataset ?
So just for sake of completeness, what if the Id
is a String, the dense rank won't work ?
For example ?
Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...
So the result looks like this:
grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...
回答1:
A clean way to do this would be to use dense_rank()
from Window
functions. It enumerates the unique values in your Window
column. Because city
is a String
column, these will be increasing alphabetically.
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
val df = spark.createDataFrame(Seq(
(1, "philip", 2.0, "montreal"),
(2, "john", 4.0, "montreal"),
(3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")
val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()
+---+--------+------+--------+
| id|username|rating| city|
+---+--------+------+--------+
| 1| philip| 2.0|montreal|
| 1| john| 4.0|montreal|
| 2| charles| 2.0| texas|
+---+--------+------+--------+
回答2:
Try:
df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))
来源:https://stackoverflow.com/questions/40047620/apache-spark-update-a-row-in-an-rdd-or-dataset-based-on-another-row