Spark streaming data sharing between batches

后端 未结 1 1376
隐瞒了意图╮
隐瞒了意图╮ 2021-02-04 17:56

Spark streaming processes the data in micro batches.

Each interval data is processed in parallel using RDDs with out any data sharing between each interval.

But

相关标签:
1条回答
  • 2021-02-04 18:11

    This is possible by "remembering" the last RDD received and using a left join to merge that data with the next streaming batch. We make use of streamingContext.remember to enable RDDs produced by the streaming process to be kept for the time we need them.

    We make use of the fact that dstream.transform is an operation that executes on the driver and therefore we have access to all local object definitions. In particular we want to update the mutable reference to the last RDD with the required value on each batch.

    Probably a piece of code makes that idea more clear:

    // configure the streaming context to remember the RDDs produced
    // choose at least 2x the time of the streaming interval
    ssc.remember(xx Seconds)  
    
    // Initialize the "currentData" with an empty RDD of the expected type
    var currentData: RDD[(String, Int)] = sparkContext.emptyRDD
    
    // classic word count
    val w1dstream = dstream.map(elem => (elem,1))    
    val count = w1dstream.reduceByKey(_ + _)    
    
    // Here's the key to make this work. Look how we update the value of the last RDD after using it. 
    val diffCount = count.transform{ rdd => 
                    val interestingKeys = Set("hadoop", "spark")               
                    val interesting = rdd.filter{case (k,v) => interestingKeys(k)}                                
                    val countDiff = rdd.leftOuterJoin(currentData).map{case (k,(v1,v2)) => (k,v1-v2.getOrElse(0))}
                    currentData = interesting
                    countDiff                
                   }
    
    diffCount.print()
    
    0 讨论(0)
提交回复
热议问题