问题
Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'.
Input and expected output are given below. What needs to happen is:
All incoming rows should get appended to the existing data.
Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows:
pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)
pk=2, amount = 100 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)
pk=3, amount = 750 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)
How do I do this in Spark?
Existing Data:
+---+------+-------------------+-------------------+------+
| pk|amount| startDate| endDate|active|
+---+------+-------------------+-------------------+------+
| 1| 10|2019-01-01 12:00:00|2019-01-20 05:00:00| 0|
| 1| 20|2019-01-20 05:00:00| null| 1|
| 2| 100|2019-01-01 00:00:00| null| 1|
| 3| 75|2019-01-01 06:00:00|2019-01-26 08:00:00| 0|
| 3| 750|2019-01-26 08:00:00| null| 1|
| 10| 40|2019-01-01 00:00:00| null| 1|
+---+------+-------------------+-------------------+------+
New Incoming Data:
+---+------+-------------------+-------------------+------+
| pk|amount| startDate| endDate|active|
+---+------+-------------------+-------------------+------+
| 1| 50|2019-02-01 07:00:00|2019-02-02 08:00:00| 0|
| 1| 75|2019-02-02 08:00:00| null| 1|
| 2| 200|2019-02-01 05:00:00|2019-02-01 13:00:00| 0|
| 2| 60|2019-02-01 13:00:00|2019-02-01 19:00:00| 0|
| 2| 500|2019-02-01 19:00:00| null| 1|
| 3| 175|2019-02-01 00:00:00| null| 1|
| 4| 50|2019-02-02 12:00:00|2019-02-02 14:00:00| 0|
| 4| 300|2019-02-02 14:00:00| null| 1|
| 5| 500|2019-02-02 00:00:00| null| 1|
+---+------+-------------------+-------------------+------+
Expected Output:
+---+------+-------------------+-------------------+------+
| pk|amount| startDate| endDate|active|
+---+------+-------------------+-------------------+------+
| 1| 10|2019-01-01 12:00:00|2019-01-20 05:00:00| 0|
| 1| 20|2019-01-20 05:00:00|2019-02-01 07:00:00| 0|
| 1| 50|2019-02-01 07:00:00|2019-02-02 08:00:00| 0|
| 1| 75|2019-02-02 08:00:00| null| 1|
| 2| 100|2019-01-01 00:00:00|2019-02-01 05:00:00| 0|
| 2| 200|2019-02-01 05:00:00|2019-02-01 13:00:00| 0|
| 2| 60|2019-02-01 13:00:00|2019-02-01 19:00:00| 0|
| 2| 500|2019-02-01 19:00:00| null| 1|
| 3| 75|2019-01-01 06:00:00|2019-01-26 08:00:00| 0|
| 3| 750|2019-01-26 08:00:00|2019-02-01 00:00:00| 1|
| 3| 175|2019-02-01 00:00:00| null| 1|
| 4| 50|2019-02-02 12:00:00|2019-02-02 14:00:00| 0|
| 4| 300|2019-02-02 14:00:00| null| 1|
| 5| 500|2019-02-02 00:00:00| null| 1|
| 10| 40|2019-01-01 00:00:00| null| 1|
+---+------+-------------------+-------------------+------+
回答1:
You can start by selecting the first startDate
for each group pk
from the new DataFrame and join with the old one to update the desired columns.
Then, you can union all the join result and the new DataFrame.
Something like this:
// get first state by date for each pk group
val w = Window.partitionBy($"pk").orderBy($"startDate")
val updates = df_new.withColumn("rn", row_number.over(w)).filter("rn = 1").select($"pk", $"startDate")
// join with old data and update old values when there is match
val joinOldNew = df_old.join(updates.alias("new"), Seq("pk"), "left")
.withColumn("endDate", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, $"new.startDate").otherwise($"endDate"))
.withColumn("active", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, lit(0)).otherwise($"active"))
.drop($"new.startDate")
// union all
val result = joinOldNew.unionAll(df_new)
回答2:
- Union 2 data frames
groupByKey
on pkmapGroups
will provide a tuple of key and iterator of rows.- On each group, sort rows, iterate over all rows, close records you want and keep the rows you want.
val df = //read your df containing
df.groupByKey( row => (row.getAs[String]("pk")))
.mapGroups( case (key, rows) =>
// apply all logic you need to apply per PK.
//sort rows by date, survive the latest, close the old )
回答3:
Thanks to answer suggested by @blackbishop I was able to get it working. Here's the working version (in case someone's interested):
// get first state by date for each pk group
val w = Window.partitionBy("pk").orderBy("startDate")
val updates = dfNew.withColumn("rn", row_number.over(w)).filter("rn = 1").select("pk", "startDate")
// join with old data and update old values when there is match
val joinOldNew = dfOld.join(updates.alias("new"), Seq("pk"), "left")
.withColumn("endDate", when(col("endDate").isNull
&& col("active") === lit(1) && col("new.startDate").isNotNull,
col("new.startDate")).otherwise(col("endDate")))
.withColumn("active", when(col("endDate").isNull, lit(1))
.otherwise(lit(0)))
.drop(col("new.startDate"))
// union all (Order By is not necessary! Added to facilitate testing.)
val results = joinOldNew.union(dfNew).orderBy(col("pk"), col("startDate"))
来源:https://stackoverflow.com/questions/59586700/implement-scd-type-2-in-spark