Pyspark : Interpolation of missing values in pyspark dataframe observed

前端未结

关注

 2  595

有刺的猬

I am trying to clean a time series dataset using spark that is not fully populated and fairly large.

What I would like to do is convert the following dataset as su

相关标签:

2条回答

广开言路

2021-01-03 11:54

After a chat with @ndricca I've updated the code with @leo suggestions.

1st DataFrame creation:

from pyspark.sql import functions as F
from pyspark.sql import Window

data = [
    ("A","01-01-2018",1),
    ("A","01-02-2018",2),
    ("A","01-03-2018",None),
    ("A","01-04-2018",None),
    ("A","01-05-2018",5),
    ("A","01-06-2018",None),
    ("A","01-07-2018",10),
    ("A","01-08-2018",11)
]
df = spark.createDataFrame(data,['Group','TS','Value'])
df = df.withColumn('TS',F.unix_timestamp('TS','MM-dd-yyyy').cast('timestamp'))

Next the updated function:

def fill_linear_interpolation(df,id_cols,order_col,value_col):
    """
    Apply linear interpolation to dataframe to fill gaps.

    :param df: spark dataframe
    :param id_cols: string or list of column names to partition by the window function
    :param order_col: column to use to order by the window function
    :param value_col: column to be filled

    :returns: spark dataframe updated with interpolated values
    """
    # create row number over window and a column with row number only for non missing values

    w = Window.partitionBy(id_cols).orderBy(order_col)
    new_df = df.withColumn('rn',F.row_number().over(w))
    new_df = new_df.withColumn('rn_not_null',F.when(F.col(value_col).isNotNull(),F.col('rn')))

    # create relative references to the start value (last value not missing)
    w_start = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(Window.unboundedPreceding,-1)
    new_df = new_df.withColumn('start_val',F.last(value_col,True).over(w_start))
    new_df = new_df.withColumn('start_rn',F.last('rn_not_null',True).over(w_start))

    # create relative references to the end value (first value not missing)
    w_end = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(0,Window.unboundedFollowing)
    new_df = new_df.withColumn('end_val',F.first(value_col,True).over(w_end))
    new_df = new_df.withColumn('end_rn',F.first('rn_not_null',True).over(w_end))

    if not isinstance(id_cols, list):
        id_cols = [id_cols]

    # create references to gap length and current gap position
    new_df = new_df.withColumn('diff_rn',F.col('end_rn')-F.col('start_rn'))
    new_df = new_df.withColumn('curr_rn',F.col('diff_rn')-(F.col('end_rn')-F.col('rn')))

    # calculate linear interpolation value
    lin_interp_func = (F.col('start_val')+(F.col('end_val')-F.col('start_val'))/F.col('diff_rn')*F.col('curr_rn'))
    new_df = new_df.withColumn(value_col,F.when(F.col(value_col).isNull(),lin_interp_func).otherwise(F.col(value_col)))

    new_df = new_df.drop('rn', 'rn_not_null', 'start_val', 'end_val', 'start_rn', 'end_rn', 'diff_rn', 'curr_rn')
    return new_df

Then function execution on our DataFrame:

new_df = fill_linear_interpolation(df=df,id_cols='Group',order_col='TS',value_col='Value')

Also checked it on my df -> post, you have to create additional group column first.

0 讨论(0)

囚心锁ツ

2021-01-03 12:18

I have implemented a solution working for Spark 2.2, mainly based on window functions. Hope could still help someone other!

First, let's recreate the dataframe:

from pyspark.sql import functions as F
from pyspark.sql import Window

data = [
    ("A","01-01-2018",1),
    ("A","01-02-2018",2),
    ("A","01-03-2018",None),
    ("A","01-04-2018",None),
    ("A","01-05-2018",5),
    ("A","01-06-2018",None),
    ("A","01-07-2018",10),
    ("A","01-08-2018",11)
]
df = spark.createDataFrame(data,['Group','TS','Value'])
df = df.withColumn('TS',F.unix_timestamp('TS','MM-dd-yyyy').cast('timestamp'))

Now, the function:

def fill_linear_interpolation(df,id_cols,order_col,value_col):
    """ 
    Apply linear interpolation to dataframe to fill gaps. 

    :param df: spark dataframe
    :param id_cols: string or list of column names to partition by the window function 
    :param order_col: column to use to order by the window function
    :param value_col: column to be filled

    :returns: spark dataframe updated with interpolated values
    """
    # create row number over window and a column with row number only for non missing values
    w = Window.partitionBy(id_cols).orderBy(order_col)
    new_df = new_df.withColumn('rn',F.row_number().over(w))
    new_df = new_df.withColumn('rn_not_null',F.when(F.col(value_col).isNotNull(),F.col('rn')))

    # create relative references to the start value (last value not missing)
    w_start = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(Window.unboundedPreceding,-1)
    new_df = new_df.withColumn('start_val',F.last(value_col,True).over(w_start))
    new_df = new_df.withColumn('start_rn',F.last('rn_not_null',True).over(w_start))

    # create relative references to the end value (first value not missing)
    w_end = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(0,Window.unboundedFollowing)
    new_df = new_df.withColumn('end_val',F.first(value_col,True).over(w_end))
    new_df = new_df.withColumn('end_rn',F.first('rn_not_null',True).over(w_end))

    # create references to gap length and current gap position  
    new_df = new_df.withColumn('diff_rn',F.col('end_rn')-F.col('start_rn'))
    new_df = new_df.withColumn('curr_rn',F.col('diff_rn')-(F.col('end_rn')-F.col('rn')))

    # calculate linear interpolation value
    lin_interp_func = (F.col('start_val')+(F.col('end_val')-F.col('start_val'))/F.col('diff_rn')*F.col('curr_rn'))
    new_df = new_df.withColumn(value_col,F.when(F.col(value_col).isNull(),lin_interp_func).otherwise(F.col(value_col)))

    keep_cols = id_cols + [order_col,value_col]
    new_df = new_df.select(keep_cols)
    return new_df

Finally:

new_df = fill_linear_interpolation(df=df,id_cols='Group',order_col='TS',value_col='Value')
#+-----+-------------------+-----+
#|Group|                 TS|Value|
#+-----+-------------------+-----+
#|    A|2018-01-01 00:00:00|  1.0|
#|    A|2018-01-02 00:00:00|  2.0|
#|    A|2018-01-03 00:00:00|  3.0|
#|    A|2018-01-04 00:00:00|  4.0|
#|    A|2018-01-05 00:00:00|  5.0|
#|    A|2018-01-06 00:00:00|  7.5|
#|    A|2018-01-07 00:00:00| 10.0|
#|    A|2018-01-08 00:00:00| 11.0|
#+-----+-------------------+-----+

0 讨论(0)