How history RDDs are preserved for further use in the given code

前端未结

关注

 1  1284

{
var history: RDD[(String, List[String]) = sc.emptyRDD()

val dstream1 = ...
val dstream2 = ...

val historyDStream = dstream1.transform(rdd => rdd.union(history))
v


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2021-01-25 19:11
              
            
            
                                                                       
We can understand how the history builds up in this case by observing how the RDD lineage evolves over time.

We need two pieces of previous knowledge:


RDDs are immutable structures
Operations on RDD can be expressed in functional terms by the function to be applied and references to the input RDDs. 


Let's see a quick example, using the classical wordCount:

val txt = sparkContext.textFile(someFile)
val words = txt.flatMap(_.split(" "))


In simplified terms, txt is a HadoopRDD(someFile). words is a MapPartitionsRDD(txt, flatMapFunction). We speak of the lineage of words as the DAG (Direct Acyclic Graph) that is formed of this chaining of operations.: HadoopRDD <-- MapPartitionsRDD.

We can apply the same principles to our streaming operation:

At iteration 0, we have

var history: RDD[(String, List[String]) = sc.emptyRDD()
// -> history: EmptyRDD
...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd.union(EmptyRDD)
join, filter
// underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred)
map
// -> underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f) 
history.unpersist(false)
//  EmptyRDD.unpersist (does nothing, it was never persisted)
history = formatted
// history =  ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.persist(...)
// history marked for persistence (at the next action)
history.count()
// ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f).count()
// cache result of:  ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)


At iteration 1, we have (adding rdd0, rdd1 as iteration index):

val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f))
join, filter
// underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred)
map
// -> underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f) 
history.unpersist(false)
//  history0.unpersist (marks the previous result for removal, we used it already for our computation above)
history = formatted
// history1 =  ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.persist(...)
// new history marked for persistence (at the next action)
history.count()
// ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f).count()
// cache result sothat we don't need to compute it next time  


This iterative process goes on with each iteration.

As we can see, the graph representing the RDD computation keeps on growing. cache reduces the cost of making all calculations each time. checkpoint is needed every so often to write a concrete computed value of this growing graph so that we can use it as baseline instead of having to evaluate the whole chain.

An interesting way to see this process in action is by adding a line within the foreachRDD to inspect the current lineage:

...
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
println(history.toDebugString())
...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复