How do you create merge_asof functionality in PySpark?

后端未结

关注

 2  1858

北恋 2021-02-14 07:15

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular int

2条回答

星月不相逢 (楼主)

2021-02-14 07:43

I doubt that it is faster, but you could solve it with Spark by using union and last together with a window function.

from pyspark.sql import functions as f
from pyspark.sql.window import Window

df1 = df1.withColumn('Key', f.lit(None))
df2 = df2.withColumn('Column1', f.lit(None))

df3 = df1.unionByName(df2)

w = Window.orderBy('Datetime', 'Column1').rowsBetween(Window.unboundedPreceding, -1)
df3.withColumn('Key', f.last('Key', True).over(w)).filter(~f.isnull('Column1')).show()

Which gives

+-------+----------+---+
|Column1|  Datetime|Key|
+-------+----------+---+
|      A|2019-02-03|  2|
|      B|2019-03-14|  4|
+-------+----------+---+

It's an old question but maybe still useful for somebody.

0 讨论(0)

查看其它2个回答