What’s a performant way to do fuzzy joins in PySpark?
I am looking for the community\'s views on a scalable approach to joining large Spark DataFrames on a nearest key c
What you are looking for is a temporal join. Check out the time series Spark library Flint (formerly HuoHua, Spark in Chinese): https://github.com/twosigma/flint
Using this library, for 2 given Time Series DataFrames (the documentation explains these objects), you can perform in PySpark (or Scala Spark):
ddf_event = ...
ddf_gps = ...
result = ddf_event.leftJoin(ddf_gps, tolerance = "1day")
Your timestamps were not clear, so set tolerance according to your needs. You can also do 'future joins' if needed.
Check out their Spark Summit presentation for more explanation and examples: https://youtu.be/g8o5-2lLcvQ