Problems with adding a new column to a dataframe - spark/scala

前端 未结 1 841
予麋鹿
予麋鹿 2021-01-16 17:53

I am new to spark/scala. I am trying to read some data from a hive table to a spark dataframe and then add a column based on some condition. Here is my code:



        
1条回答
  •  南笙
    南笙 (楼主)
    2021-01-16 18:37

    You can simply use datediff inbuilt function to check for the days difference between two columns. you don't need to write your function or udf function. And when function is also modified than yours

    import org.apache.spark.sql.functions._
    val finalDF = DF.withColumn("status",
      when(col("past_due").equalTo(1) && col("item_due_date").isNotNull && !(lower(col("item_due_date")).equalTo("null")) && (datediff(col("partition_date"),col("item_due_date")) < 0) && col("item_decision").isNotNull && !(lower(col("item_decision")).equalTo("null")), "approved")
        .otherwise(when(col("past_due").equalTo(1) && col("item_due_date").isNotNull && !(lower(col("item_due_date")).equalTo("null")) && (datediff(col("partition_date"),col("item_due_date")) < 0) && (col("item_decision").isNull || lower(col("item_decision")).equalTo("null")), "pending")
          .otherwise(when(col("past_due").equalTo(1) && col("item_due_date").isNotNull && !(lower(col("item_due_date")).equalTo("null")) && (datediff(col("partition_date"),col("item_due_date")) >= 0), "expired")
        .otherwise("null"))))
    

    This logic will convert the dataframe

    +--------+-------------+-------------+--------------+
    |past_due|item_due_date|item_decision|partition_date|
    +--------+-------------+-------------+--------------+
    |1       |2017-12-14   |null         |2017-11-22    |
    |1       |2017-12-14   |Mitigate     |2017-11-22    |
    |1       |0001-01-14   |Mitigate     |2017-11-22    |
    |1       |0001-01-14   |Mitigate     |2017-11-22    |
    |0       |2018-03-18   |null         |2017-11-22    |
    |1       |2016-11-30   |null         |2017-11-22    |
    +--------+-------------+-------------+--------------+
    

    with addition of status column as

    +--------+-------------+-------------+--------------+--------+
    |past_due|item_due_date|item_decision|partition_date|status  |
    +--------+-------------+-------------+--------------+--------+
    |1       |2017-12-14   |null         |2017-11-22    |pending |
    |1       |2017-12-14   |Mitigate     |2017-11-22    |approved|
    |1       |0001-01-14   |Mitigate     |2017-11-22    |expired |
    |1       |0001-01-14   |Mitigate     |2017-11-22    |expired |
    |0       |2018-03-18   |null         |2017-11-22    |null    |
    |1       |2016-11-30   |null         |2017-11-22    |expired |
    +--------+-------------+-------------+--------------+--------+
    

    I hope the answer is helpful

    0 讨论(0)
提交回复
热议问题