How to pick latest record in spark structured streaming join

后端 未结 1 1844
暖寄归人
暖寄归人 2021-01-16 10:06

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.

I have rates meta data of currenc

相关标签:
1条回答
  • 2021-01-16 10:24

    Replace your last code part with below code. This code will do left join and calculate date difference calc_date & rate_date. Next Window function we will pick nearest date and calculate prev_sales by using same your calculation.

    Please note I have added one filter condition filter(col("diff") >=0), which will handle a case of calc_date < rate_date. I have added few more records for better understanding of this case.

    scala> ratesMetaDataDf.show
    +---------+----------+----------+-----------+
    |base_code| rate_date|rate_value|target_code|
    +---------+----------+----------+-----------+
    |      EUR|2019-05-10|  1.130657|        USD|
    |      EUR|2019-05-09|   1.12088|        USD|
    |      EUR|2019-12-20|    1.1584|        USD|
    +---------+----------+----------+-----------+
    
    
    scala> kafkaDf.show
    +---------+----+-------+-----+----+----------+------+----------+
    |companyId|year|quarter|sales|code| calc_date|c_code|prev_sales|
    +---------+----+-------+-----+----+----------+------+----------+
    |       15|2016|      4|100.5| USD|2021-01-20|   EUR|     221.4|
    |       15|2016|      4|100.5| USD|2019-06-20|   EUR|     221.4|
    +---------+----+-------+-----+----+----------+------+----------+
    
    
    scala>  val W = Window.partitionBy("companyId","year","quarter","sales","code","calc_date","c_code","prev_sales").orderBy(col("diff"))
    
    scala>   val rateJoinResultDf= kafkaDf.alias("k").join(ratesMetaDataDf.alias("r"), col("k.c_code") === col("r.base_code"), "left")
                                             .withColumn("diff",datediff(col("calc_date"), col("rate_date")))
                                             .filter(col("diff") >= 0)
                                             .withColumn("closedate", row_number.over(W))
                                             .filter(col("closedate") === 1)
                                             .drop("diff", "closedate")
                                             .withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast("Decimal(14,5)"))
                                             .select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
    
    scala> rateJoinResultDf.show
    +---------+----+-------+-----+----+----------+----------+
    |companyId|year|quarter|sales|code| calc_date|prev_sales|
    +---------+----+-------+-----+----+----------+----------+
    |       15|2016|      4|100.5| USD|2021-01-20| 256.46976|
    |       15|2016|      4|100.5| USD|2019-06-20| 250.32746|
    +---------+----+-------+-----+----+----------+----------+ 
    
    0 讨论(0)
提交回复
热议问题