Filling missing dates in spark dataframe column

前端 未结 2 336
青春惊慌失措
青春惊慌失措 2021-01-07 06:07

I\'ve a spark data frame with columns - \"date\" of type timestamp and \"quantity\" of type long. For each date, I\'ve some value for quantity. The

相关标签:
2条回答
  • 2021-01-07 06:50

    I have written this answer in bit verbose way for easy understanding of the code. It can be optimized.

    Needed imports

    import java.time.format.DateTimeFormatter
    import java.time.{LocalDate, LocalDateTime}
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.{LongType, TimestampType}
    

    UDFs for String to Valid date format

     val date_transform = udf((date: String) => {
        val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
        val dt = LocalDate.parse(date, dtFormatter)
        "%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
          .replaceAll(" ", "0")
      })
    

    Below UDF code taken from Iterate over dates range

      def fill_dates = udf((start: String, excludedDiff: Int) => {
        val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
        val fromDt = LocalDateTime.parse(start, dtFormatter)
        (1 to (excludedDiff - 1)).map(day => {
          val dt = fromDt.plusDays(day)
          "%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
            .replaceAll(" ", "0")
        })
      })
    

    Setting up sample dataframe (df)

    val df = Seq(
          ("10-09-2016", 1),
          ("11-09-2016", 2),
          ("14-09-2016", 0),
          ("16-09-2016", 1),
          ("17-09-2016", 0),
          ("20-09-2016", 2)).toDF("date", "quantity")
          .withColumn("date", date_transform($"date").cast(TimestampType))
          .withColumn("quantity", $"quantity".cast(LongType))
    
    df.printSchema()
    root
     |-- date: timestamp (nullable = true)
     |-- quantity: long (nullable = false)
    
    
    df.show()    
    +-------------------+--------+
    |               date|quantity|
    +-------------------+--------+
    |2016-09-10 00:00:00|       1|
    |2016-09-11 00:00:00|       2|
    |2016-09-14 00:00:00|       0|
    |2016-09-16 00:00:00|       1|
    |2016-09-17 00:00:00|       0|
    |2016-09-20 00:00:00|       2|
    +-------------------+--------+
    

    Create a temporary dataframe(tempDf) to union with df:

    val w = Window.orderBy($"date")
    val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
      .filter($"diff" > 1) // Pick date diff more than one day to generate our date
      .withColumn("next_dates", fill_dates($"date", $"diff"))
      .withColumn("quantity", lit("0"))
      .withColumn("date", explode($"next_dates"))
      .withColumn("date", $"date".cast(TimestampType))
    
    tempDf.show(false)
    +-------------------+--------+----+------------------------+
    |date               |quantity|diff|next_dates              |
    +-------------------+--------+----+------------------------+
    |2016-09-12 00:00:00|0       |3   |[2016-09-12, 2016-09-13]|
    |2016-09-13 00:00:00|0       |3   |[2016-09-12, 2016-09-13]|
    |2016-09-15 00:00:00|0       |2   |[2016-09-15]            |
    |2016-09-18 00:00:00|0       |3   |[2016-09-18, 2016-09-19]|
    |2016-09-19 00:00:00|0       |3   |[2016-09-18, 2016-09-19]|
    +-------------------+--------+----+------------------------+
    

    Now union two dataframes

    val result = df.union(tempDf.select("date", "quantity"))
      .orderBy("date")
    
    result.show()
    +-------------------+--------+
    |               date|quantity|
    +-------------------+--------+
    |2016-09-10 00:00:00|       1|
    |2016-09-11 00:00:00|       2|
    |2016-09-12 00:00:00|       0|
    |2016-09-13 00:00:00|       0|
    |2016-09-14 00:00:00|       0|
    |2016-09-15 00:00:00|       0|
    |2016-09-16 00:00:00|       1|
    |2016-09-17 00:00:00|       0|
    |2016-09-18 00:00:00|       0|
    |2016-09-19 00:00:00|       0|
    |2016-09-20 00:00:00|       2|
    +-------------------+--------+
    
    0 讨论(0)
  • 2021-01-07 06:55

    Based on the @mrsrinivas excelent answer, here is the PySpark version.

    Needed imports

    from typing import List
    import datetime
    from pyspark.sql import DataFrame, Window
    from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
    from pyspark.sql.types import DateType, ArrayType
    

    UDF to create the range of next dates

    def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
        return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
    

    Function the create the DateFrame filling the dates (support "grouping" columns):

    def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
        get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
    
        window = Window.orderBy(*group_columns, date_column)
    
        return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
            .filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
            .withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
            .drop("_diff", "_next_dates")
    

    The usage of the function:

    fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
    df = df.union(fill_df)
    

    It assumes that the date column is already in date type.

    0 讨论(0)
提交回复
热议问题