发表新帖

发表新帖

Joining Spark DataFrames on a nearest key condition

前端未结

关注

 1  815

What’s a performant way to do fuzzy joins in PySpark?

I am looking for the community\'s views on a scalable approach to joining large Spark DataFrames on a nearest key c

相关标签:

1条回答

难免孤独

2021-02-04 17:21
What you are looking for is a temporal join. Check out the time series Spark library Flint (formerly HuoHua, Spark in Chinese): https://github.com/twosigma/flint

Using this library, for 2 given Time Series DataFrames (the documentation explains these objects), you can perform in PySpark (or Scala Spark):
```
ddf_event = ...
ddf_gps = ...
result = ddf_event.leftJoin(ddf_gps, tolerance = "1day")
```
Your timestamps were not clear, so set tolerance according to your needs. You can also do 'future joins' if needed.

Check out their Spark Summit presentation for more explanation and examples: https://youtu.be/g8o5-2lLcvQ
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题