How do you create merge_asof functionality in PySpark?

后端未结

关注

 2  1860

北恋 2021-02-14 07:15

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular int

2条回答

爱一瞬间的悲伤 (楼主)

2021-02-14 07:23

Figured out a fast (but perhaps not the most efficient) method to complete this. I built a helper function:

def get_close_record(df, key_column, datetime_column, record_time):
    """
    Takes in ordered dataframe and returns the closest 
    record that is higher than the datetime given.
    """
    filtered_df = df[df[datetime_column] >= record_time][0:1]
    [key] = filtered_df[key_column].values.tolist()
    return key

Instead of joining B to A, I set up a pandas_udf of the above code and ran it on the columns of table B then ran groupBy on B with primary key A_key and aggregated B_key by max.

The issue with this method is that it requires monotonically increasing keys in B.

Better solution:

I developed the following helper function that should work

other_df['_0'] = other_df['Datetime']
bdf = sc.broadcast(other_df)

#merge asof udf
@F.pandas_udf('long')
def join_asof(v, other=bdf.value):
    f = pd.DataFrame(v)
    j = pd.merge_asof(f, other, on='_0', direction = 'forward')
    return j['Key']

joined = df.withColumn('Key', join_asof(F.col('Datetime')))

0 讨论(0)

查看其它2个回答