How history RDDs are preserved for further use in the given code

前端 未结 1 1284
梦谈多话
梦谈多话 2021-01-25 18:49
{
var history: RDD[(String, List[String]) = sc.emptyRDD()

val dstream1 = ...
val dstream2 = ...

val historyDStream = dstream1.transform(rdd => rdd.union(history))
v         


        
相关标签:
1条回答
  • 2021-01-25 19:11

    We can understand how the history builds up in this case by observing how the RDD lineage evolves over time.

    We need two pieces of previous knowledge:

    • RDDs are immutable structures
    • Operations on RDD can be expressed in functional terms by the function to be applied and references to the input RDDs.

    Let's see a quick example, using the classical wordCount:

    val txt = sparkContext.textFile(someFile)
    val words = txt.flatMap(_.split(" "))
    

    In simplified terms, txt is a HadoopRDD(someFile). words is a MapPartitionsRDD(txt, flatMapFunction). We speak of the lineage of words as the DAG (Direct Acyclic Graph) that is formed of this chaining of operations.: HadoopRDD <-- MapPartitionsRDD.

    We can apply the same principles to our streaming operation:

    At iteration 0, we have

    var history: RDD[(String, List[String]) = sc.emptyRDD()
    // -> history: EmptyRDD
    ...
    val historyDStream = dstream1.transform(rdd => rdd.union(history))
    // -> underlying RDD: rdd.union(EmptyRDD)
    join, filter
    // underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred)
    map
    // -> underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f) 
    history.unpersist(false)
    //  EmptyRDD.unpersist (does nothing, it was never persisted)
    history = formatted
    // history =  ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
    history.persist(...)
    // history marked for persistence (at the next action)
    history.count()
    // ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f).count()
    // cache result of:  ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
    

    At iteration 1, we have (adding rdd0, rdd1 as iteration index):

    val historyDStream = dstream1.transform(rdd => rdd.union(history))
    // -> underlying RDD: rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f))
    join, filter
    // underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred)
    map
    // -> underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f) 
    history.unpersist(false)
    //  history0.unpersist (marks the previous result for removal, we used it already for our computation above)
    history = formatted
    // history1 =  ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
    history.persist(...)
    // new history marked for persistence (at the next action)
    history.count()
    // ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f).count()
    // cache result sothat we don't need to compute it next time  
    

    This iterative process goes on with each iteration.

    As we can see, the graph representing the RDD computation keeps on growing. cache reduces the cost of making all calculations each time. checkpoint is needed every so often to write a concrete computed value of this growing graph so that we can use it as baseline instead of having to evaluate the whole chain.

    An interesting way to see this process in action is by adding a line within the foreachRDD to inspect the current lineage:

    ...
    history.unpersist(false) // unpersist the 'old' history RDD
    history = formatted // assign the new history
    println(history.toDebugString())
    ...
    
    0 讨论(0)
提交回复
热议问题