Forward Fill New Row to Account for Missing Dates

前端 未结 1 1577
心在旅途
心在旅途 2020-12-20 08:26

I currently have a dataset grouped into hourly increments by a variable \"aggregator\". There are gaps in this hourly data and what i would ideally like to do is forward fil

相关标签:
1条回答
  • 2020-12-20 08:27

    Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.

    from pyspark.sql.window import Window
    from pyspark.sql.types import *
    from pyspark.sql.functions import *
    from dateutil.relativedelta import relativedelta
    
    def missing_hours(t1, t2):
        return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
    
    missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
    
    df = spark.read.csv('dates.csv',header=True,inferSchema=True)
    
    window = Window.partitionBy("aggregator").orderBy("timestamp")
    
    df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
           .filter(col("prev_timestamp").isNotNull())\
           .withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
           .drop("prev_timestamp")
    
    df.union(df_mising).orderBy("aggregator","timestamp").show()
    

    which results

    +-------------------+----------+
    |          timestamp|aggregator|
    +-------------------+----------+
    |2018-12-27 09:00:00|         A|
    |2018-12-27 10:00:00|         A|
    |2018-12-27 11:00:00|         A|
    |2018-12-27 12:00:00|         A|
    |2018-12-27 13:00:00|         A|
    |2018-12-27 09:00:00|         B|
    |2018-12-27 10:00:00|         B|
    |2018-12-27 11:00:00|         B|
    |2018-12-27 12:00:00|         B|
    |2018-12-27 13:00:00|         B|
    |2018-12-27 14:00:00|         B|
    +-------------------+----------+
    
    0 讨论(0)
提交回复
热议问题