Table A
has many columns with a date column, Table B
has a datetime and a value. The data in both tables are generated sporadically with no regular int
Figured out a fast (but perhaps not the most efficient) method to complete this. I built a helper function:
def get_close_record(df, key_column, datetime_column, record_time):
"""
Takes in ordered dataframe and returns the closest
record that is higher than the datetime given.
"""
filtered_df = df[df[datetime_column] >= record_time][0:1]
[key] = filtered_df[key_column].values.tolist()
return key
Instead of joining B
to A
, I set up a pandas_udf
of the above code and ran it on the columns of table B
then ran groupBy
on B
with primary key A_key
and aggregated B_key
by max
.
The issue with this method is that it requires monotonically increasing keys in B
.
Better solution:
I developed the following helper function that should work
other_df['_0'] = other_df['Datetime']
bdf = sc.broadcast(other_df)
#merge asof udf
@F.pandas_udf('long')
def join_asof(v, other=bdf.value):
f = pd.DataFrame(v)
j = pd.merge_asof(f, other, on='_0', direction = 'forward')
return j['Key']
joined = df.withColumn('Key', join_asof(F.col('Datetime')))