Pyspark - how to backfill a DataFrame?

前端 未结 2 484
醉酒成梦
醉酒成梦 2021-01-02 22:36

How can you do the same thing as df.fillna(method=\'bfill\') for a pandas dataframe with a pyspark.sql.DataFrame?

The

2条回答
  •  别那么骄傲
    2021-01-02 23:28

    Actually backfill on distributed dataset is not as easy task as in pandas (local) dataframe - you cannot be sure that value to fill exists in the same partition. I would use crossJoin with windowing, for example fo DF:

    df = spark.createDataFrame([
        ('2017-01-01', None), 
        ('2017-01-02', 'B'), 
        ('2017-01-03', None), 
        ('2017-01-04', None), 
        ('2017-01-05', 'E'), 
        ('2017-01-06', None), 
        ('2017-01-07', 'G')], ['date', 'value'])
    df.show()
    
    +----------+-----+
    |      date|value|
    +----------+-----+
    |2017-01-01| null|
    |2017-01-02|    B|
    |2017-01-03| null|
    |2017-01-04| null|
    |2017-01-05|    E|
    |2017-01-06| null|
    |2017-01-07|    G|
    +----------+-----+
    

    The code would be:

    from pyspark.sql.window import Window
    
    df.alias('a').crossJoin(df.alias('b')) \
        .where((col('b.date') >= col('a.date')) & (col('a.value').isNotNull() | col('b.value').isNotNull())) \
        .withColumn('rn', row_number().over(Window.partitionBy('a.date').orderBy('b.date'))) \
        .where(col('rn') == 1) \
        .select('a.date', coalesce('a.value', 'b.value').alias('value')) \
        .orderBy('a.date') \
        .show()
    
    +----------+-----+
    |      date|value|
    +----------+-----+
    |2017-01-01|    B|
    |2017-01-02|    B|
    |2017-01-03|    E|
    |2017-01-04|    E|
    |2017-01-05|    E|
    |2017-01-06|    G|
    |2017-01-07|    G|
    +----------+-----+
    

提交回复
热议问题