Spark: How to join RDDs by time range

后端 未结 3 639
挽巷
挽巷 2021-02-05 13:27

I have a delicate Spark problem, where i just can\'t wrap my head around.

We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contai

相关标签:
3条回答
  • 2021-02-05 14:03

    I know that this question has been answered but I want to add another solution that worked for me -

    your data -

    Actions 
    id  |  time  | valueX
    1   |  12:05 | 500
    1   |  12:30 | 500
    2   |  12:30 | 125
    
    Historic 
    id  |  set_at| valueY
    1   |  11:00 | 400
    1   |  12:15 | 450
    2   |  12:20 | 50
    2   |  12:25 | 75
    
    1. Union Actions and Historic
        Combined
        id  |  time  | valueX | record-type
        1   |  12:05 | 500    | Action
        1   |  12:30 | 500    | Action
        2   |  12:30 | 125    | Action
        1   |  11:00 | 400    | Historic
        1   |  12:15 | 450    | Historic
        2   |  12:20 | 50     | Historic
        2   |  12:25 | 75     | Historic
    
    1. Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.

      Partition-1
      1   |  11:00 | 400    | Historic
      1   |  12:05 | 500    | Action
      1   |  12:15 | 450    | Historic
      1   |  12:30 | 500    | Action
      Partition-2
      2   |  12:20 | 50     | Historic
      2   |  12:25 | 75     | Historic
      2   |  12:30 | 125    | Action
      

    2. Traverse through the records per partition.

    If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.

    If it is a Action record, get the valueY value from the map and subtract it from valueX

    A map M

    Partition-1 traversal in order M={ 1 -> 400} // A new entry in map M 1 | 100 // M(1) = 400; 500-400 M={1 -> 450} // update M, because key already exists 1 | 50 // M(1) Partition-2 traversal in order M={ 2 -> 50} // A new entry in M M={ 2 -> 75} // update M, because key already exists 2 | 50 // M(2) = 75; 125-75

    You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.

    This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.

    0 讨论(0)
  • 2021-02-05 14:06

    It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:

    Given case classes for Action(id, time, x) and Historic(id, time, y)

    • Join the actions with the history (this might be heavy)
    • filter all historic data not relevant for a given action
    • key the results by (id,time) - differentiate same key at different times
    • reduce the history by action to the max value, leaving us with relevant historical record for the given action

    In Spark:

    val actionById = actions.keyBy(_.id)
    val historyById = historic.keyBy(_.id)
    val actionByHistory = actionById.join(historyById)
    val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
    val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) =>  (a1, if (h1.t>h2.t) h1 else h2)}
    
    // we are done, let's produce a report now
    val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
    

    Using the data provided above, the report looks like:

    report.collect
    Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
    

    (I transformed the time to seconds to have a simplistic timestamp)

    0 讨论(0)
  • 2021-02-05 14:13

    After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.

    First we expand our case class Historic

    case class Historic(id: String, set_at: Long, valueY: Int) {
      val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
      set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
      set_at_map.put(set_at, valueY) // .. to the set_at date
    
      // This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
      def getHistoricValue(date: Long) : Option[Int] = {
        var e = set_at_map.floorEntry(date)                                   
        if (e != null && e.getValue == null) {                                  
          e = set_at_map.lowerEntry(date)                                     
        }                                                                         
        if ( e == null ) None else e.getValue()
      }
    }
    

    The case class is ready and now we bring it into action

    val historicRDD = sc.cassandraTable[Historic](...)
      .map( row => ( row.id, row ) )
      .reduceByKey( (row1, row2) =>  {
        row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
        row1
      })
    
    // Now we load the Actions and map it by id as we did with Historic
    val actionsRDD = sc.cassandraTable[Actions](...)
      .map( row => ( row.id, row ) )
    
    // Now both RDDs have the same key and we can join them
    val fin = actionsRDD.join(historicRDD)
      .map( row => {
        ( row._1.id, 
          (
            row._2._1.id, 
            row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
          )
        )
      })
    

    I am totally new to Scala, so please let me know if we could improve this code on some place.

    0 讨论(0)
提交回复
热议问题