Joining Spark DataFrames on a nearest key condition

前端 未结 1 810
慢半拍i
慢半拍i 2021-02-04 16:46

What’s a performant way to do fuzzy joins in PySpark?

I am looking for the community\'s views on a scalable approach to joining large Spark DataFrames on a nearest key c

相关标签:
1条回答
  • 2021-02-04 17:21

    What you are looking for is a temporal join. Check out the time series Spark library Flint (formerly HuoHua, Spark in Chinese): https://github.com/twosigma/flint

    Using this library, for 2 given Time Series DataFrames (the documentation explains these objects), you can perform in PySpark (or Scala Spark):

    ddf_event = ...
    ddf_gps = ...
    result = ddf_event.leftJoin(ddf_gps, tolerance = "1day")
    

    Your timestamps were not clear, so set tolerance according to your needs. You can also do 'future joins' if needed.

    Check out their Spark Summit presentation for more explanation and examples: https://youtu.be/g8o5-2lLcvQ

    0 讨论(0)
提交回复
热议问题