发表新帖

发表新帖

Spark: How to join RDDs by time range

后端未结

关注

 3  644

挽巷 2021-02-05 13:27

I have a delicate Spark problem, where i just can\'t wrap my head around.

We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contai

3条回答

悲&欢浪女 (楼主)

2021-02-05 14:06
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:

Given case classes for Action(id, time, x) and Historic(id, time, y)
- Join the actions with the history (this might be heavy)
- filter all historic data not relevant for a given action
- key the results by (id,time) - differentiate same key at different times
- reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
```
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) =>  (a1, if (h1.t>h2.t) h1 else h2)}

// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
```
Using the data provided above, the report looks like:
```
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
```
(I transformed the time to seconds to have a simplistic timestamp)
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题