Generating monthly timestamps between two dates in pyspark dataframe

前端 未结 1 1721
粉色の甜心
粉色の甜心 2021-01-06 12:29

I have some DataFrame with \"date\" column and I\'m trying to generate a new DataFrame with all monthly timestamps between the min and max date from the \

相关标签:
1条回答
  • 2021-01-06 12:34

    Suppose you had the following DataFrame:

    data = [("2000-01-01","2002-12-01")]
    df = spark.createDataFrame(data, ["minDate", "maxDate"])
    df.show()
    #+----------+----------+
    #|   minDate|   maxDate|
    #+----------+----------+
    #|2000-01-01|2002-12-01|
    #+----------+----------+
    

    You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.

    Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:

    import pyspark.sql.functions as f
    
    df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
        .withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
        .select("*", f.posexplode("repeat").alias("date", "val"))\
        .withColumn("date", f.expr("add_months(minDate, date)"))\
        .select('date')\
        .show(n=50)
    #+----------+
    #|      date|
    #+----------+
    #|2000-01-01|
    #|2000-02-01|
    #|2000-03-01|
    #|2000-04-01|
    # ...skipping some rows...
    #|2002-10-01|
    #|2002-11-01|
    #|2002-12-01|
    #+----------+
    
    0 讨论(0)
提交回复
热议问题