How to use window specification and join condition per column values?

后端 未结 1 1052
长情又很酷
长情又很酷 2021-01-07 12:02

Here is my DF1

OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber|^|FFAction
4295858898|^|204|^|205|^|         


        
相关标签:
1条回答
  • 2021-01-07 12:48

    DISCLAIMER Somehow this and the other question I've just answered seem duplicates so one is going to get marked as such soon or we find out the difference between them and the disclaimer goes away. Time will tell.


    Given the requirement to select the final window specification and join condition based on the values of FFAction_1 column, I'd do filter first and decide what window aggregation and join to use.

    val df1 = spark.
      read.
      option("header", true).
      option("sep", "|").
      csv("df1.csv").
      select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
    scala> df1.show
    +--------------+--------------+---------------+-------------+--------+
    |OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
    +--------------+--------------+---------------+-------------+--------+
    |    4295858898|           204|            205|            1|       I|
    |    4295858898|           204|            208|            2|       I|
    |    4295858898|           204|            209|            2|       I|
    |    4295858898|           204|            211|            3|       I|
    |    4295858898|           204|            212|            3|       I|
    |    4295858898|           204|            214|            4|       I|
    |    4295858898|           204|            215|            4|       I|
    |    4295858898|           206|            207|            1|       I|
    |    4295858898|           206|            210|            2|       I|
    |    4295858898|           206|            213|            3|       I|
    +--------------+--------------+---------------+-------------+--------+
    

    The right-hand side of the join is fairly similar in "shape".

    val df2 = spark.
      read.
      option("header", true).
      option("sep", "|").
      csv("df2.csv").
      select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
    scala> df2.show
    +-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
    |  DataPartition_1|PartitionYear_1|    TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
    +-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
    |SelfSourcedPublic|           2002|1510725106270|    4295858941|            24|             25|              4|         O|
    |SelfSourcedPublic|           2002|1510725106271|    4295858941|            24|             25|              5|         O|
    |SelfSourcedPublic|           2003|1510725106272|    4295858941|            30|             31|              2|         O|
    |SelfSourcedPublic|           2003|1510725106273|    4295858941|            30|             31|              3|         O|
    |SelfSourcedPublic|           2001|1510725106293|    4295858941|             5|             20|              2|         O|
    |SelfSourcedPublic|           2001|1510725106294|    4295858941|             5|             21|              3|         O|
    |SelfSourcedPublic|           2002|1510725106295|    4295858941|             1|             22|              4|         O|
    |SelfSourcedPublic|           2002|1510725106296|    4295858941|             1|             23|              5|         O|
    |SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         I|
    |SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         D|
    +-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
    

    With the above datasets, I'd filter out to see if there's at least one I in df2 in FFAction_1 column and select the correct window specification and join condition.

    The trick is to use join operator followed by where (or filter) operator so you can decide on what join condition to use.

    val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
    val (windowSpec, joinCond) = if (noIs) {
      (windowSpecForOs, joinForOs) 
    } else {
      (windowSpecForIs, joinForIs)
    }
    val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
    val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)
    
    0 讨论(0)
提交回复
热议问题