I have a delicate Spark problem, where i just can\'t wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions
and RDD2 contai
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Actions
and Historic
Combined id | time | valueX | record-type 1 | 12:05 | 500 | Action 1 | 12:30 | 500 | Action 2 | 12:30 | 125 | Action 1 | 11:00 | 400 | Historic 1 | 12:15 | 450 | Historic 2 | 12:20 | 50 | Historic 2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id
, but sort by time
.
Partition-1 1 | 11:00 | 400 | Historic 1 | 12:05 | 500 | Action 1 | 12:15 | 450 | Historic 1 | 12:30 | 500 | Action Partition-2 2 | 12:20 | 50 | Historic 2 | 12:25 | 75 | Historic 2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical
record, add it to a map, or update the map if it already has that id - keep track of the latest valueY
per id
using a map per partition.
If it is a Action
record, get the valueY
value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time
, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x)
and Historic(id, time, y)
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.