How do you create merge_asof functionality in PySpark?

后端 未结 2 1859
北恋
北恋 2021-02-14 07:15

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular int

相关标签:
2条回答
  • 2021-02-14 07:23

    Figured out a fast (but perhaps not the most efficient) method to complete this. I built a helper function:

    def get_close_record(df, key_column, datetime_column, record_time):
        """
        Takes in ordered dataframe and returns the closest 
        record that is higher than the datetime given.
        """
        filtered_df = df[df[datetime_column] >= record_time][0:1]
        [key] = filtered_df[key_column].values.tolist()
        return key
    

    Instead of joining B to A, I set up a pandas_udf of the above code and ran it on the columns of table B then ran groupBy on B with primary key A_key and aggregated B_key by max.

    The issue with this method is that it requires monotonically increasing keys in B.

    Better solution:

    I developed the following helper function that should work

    other_df['_0'] = other_df['Datetime']
    bdf = sc.broadcast(other_df)
    
    #merge asof udf
    @F.pandas_udf('long')
    def join_asof(v, other=bdf.value):
        f = pd.DataFrame(v)
        j = pd.merge_asof(f, other, on='_0', direction = 'forward')
        return j['Key']
    
    joined = df.withColumn('Key', join_asof(F.col('Datetime')))
    
    0 讨论(0)
  • 2021-02-14 07:43

    I doubt that it is faster, but you could solve it with Spark by using union and last together with a window function.

    from pyspark.sql import functions as f
    from pyspark.sql.window import Window
    
    df1 = df1.withColumn('Key', f.lit(None))
    df2 = df2.withColumn('Column1', f.lit(None))
    
    df3 = df1.unionByName(df2)
    
    w = Window.orderBy('Datetime', 'Column1').rowsBetween(Window.unboundedPreceding, -1)
    df3.withColumn('Key', f.last('Key', True).over(w)).filter(~f.isnull('Column1')).show()
    

    Which gives

    +-------+----------+---+
    |Column1|  Datetime|Key|
    +-------+----------+---+
    |      A|2019-02-03|  2|
    |      B|2019-03-14|  4|
    +-------+----------+---+
    

    It's an old question but maybe still useful for somebody.

    0 讨论(0)
提交回复
热议问题