Implement SCD Type 2 in Spark

问题

Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'.

Input and expected output are given below. What needs to happen is:

All incoming rows should get appended to the existing data.
Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows:

pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

pk=2, amount = 100 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

pk=3, amount = 750 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

How do I do this in Spark?

Existing Data:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    10|2019-01-01 12:00:00|2019-01-20 05:00:00|     0|
|  1|    20|2019-01-20 05:00:00|               null|     1|
|  2|   100|2019-01-01 00:00:00|               null|     1|
|  3|    75|2019-01-01 06:00:00|2019-01-26 08:00:00|     0|
|  3|   750|2019-01-26 08:00:00|               null|     1|
| 10|    40|2019-01-01 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

New Incoming Data:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    50|2019-02-01 07:00:00|2019-02-02 08:00:00|     0|
|  1|    75|2019-02-02 08:00:00|               null|     1|
|  2|   200|2019-02-01 05:00:00|2019-02-01 13:00:00|     0|
|  2|    60|2019-02-01 13:00:00|2019-02-01 19:00:00|     0|
|  2|   500|2019-02-01 19:00:00|               null|     1|
|  3|   175|2019-02-01 00:00:00|               null|     1|
|  4|    50|2019-02-02 12:00:00|2019-02-02 14:00:00|     0|
|  4|   300|2019-02-02 14:00:00|               null|     1|
|  5|   500|2019-02-02 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

Expected Output:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    10|2019-01-01 12:00:00|2019-01-20 05:00:00|     0|
|  1|    20|2019-01-20 05:00:00|2019-02-01 07:00:00|     0|
|  1|    50|2019-02-01 07:00:00|2019-02-02 08:00:00|     0|
|  1|    75|2019-02-02 08:00:00|               null|     1|
|  2|   100|2019-01-01 00:00:00|2019-02-01 05:00:00|     0|
|  2|   200|2019-02-01 05:00:00|2019-02-01 13:00:00|     0|
|  2|    60|2019-02-01 13:00:00|2019-02-01 19:00:00|     0|
|  2|   500|2019-02-01 19:00:00|               null|     1|
|  3|    75|2019-01-01 06:00:00|2019-01-26 08:00:00|     0|
|  3|   750|2019-01-26 08:00:00|2019-02-01 00:00:00|     1|
|  3|   175|2019-02-01 00:00:00|               null|     1|
|  4|    50|2019-02-02 12:00:00|2019-02-02 14:00:00|     0|
|  4|   300|2019-02-02 14:00:00|               null|     1|
|  5|   500|2019-02-02 00:00:00|               null|     1|
| 10|    40|2019-01-01 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

回答1:

You can start by selecting the first startDate for each group pk from the new DataFrame and join with the old one to update the desired columns. Then, you can union all the join result and the new DataFrame.

Something like this:

// get first state by date for each pk group
val w = Window.partitionBy($"pk").orderBy($"startDate")
val updates = df_new.withColumn("rn", row_number.over(w)).filter("rn = 1").select($"pk", $"startDate")

// join with old data and update old values when there is match
val joinOldNew = df_old.join(updates.alias("new"), Seq("pk"), "left")
                       .withColumn("endDate", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, $"new.startDate").otherwise($"endDate"))
                       .withColumn("active", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, lit(0)).otherwise($"active"))
                       .drop($"new.startDate")

// union all
val result = joinOldNew.unionAll(df_new)

回答2:

Union 2 data frames
groupByKey on pk
mapGroups will provide a tuple of key and iterator of rows.
On each group, sort rows, iterate over all rows, close records you want and keep the rows you want.

   val df = //read your df containing 
   df.groupByKey( row => (row.getAs[String]("pk")))
     .mapGroups( case (key, rows) => 
     // apply all logic you need to apply per PK. 
     //sort rows by date, survive the latest, close the old )

回答3:

Thanks to answer suggested by @blackbishop I was able to get it working. Here's the working version (in case someone's interested):

    // get first state by date for each pk group
    val w = Window.partitionBy("pk").orderBy("startDate")
    val updates = dfNew.withColumn("rn", row_number.over(w)).filter("rn = 1").select("pk", "startDate")

    // join with old data and update old values when there is match
    val joinOldNew = dfOld.join(updates.alias("new"), Seq("pk"), "left")
        .withColumn("endDate", when(col("endDate").isNull
            && col("active") === lit(1) && col("new.startDate").isNotNull,
            col("new.startDate")).otherwise(col("endDate")))
        .withColumn("active", when(col("endDate").isNull, lit(1))
            .otherwise(lit(0)))
        .drop(col("new.startDate"))


    // union all (Order By is not necessary! Added to facilitate testing.)
    val results = joinOldNew.union(dfNew).orderBy(col("pk"), col("startDate"))

来源：https://stackoverflow.com/questions/59586700/implement-scd-type-2-in-spark

标签

java

scala

apache-spark

apache-spark-sql

databricks