{
var history: RDD[(String, List[String]) = sc.emptyRDD()
val dstream1 = ...
val dstream2 = ...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
v
We can understand how the history builds up in this case by observing how the RDD lineage evolves over time.
We need two pieces of previous knowledge:
Let's see a quick example, using the classical wordCount:
val txt = sparkContext.textFile(someFile)
val words = txt.flatMap(_.split(" "))
In simplified terms, txt
is a HadoopRDD(someFile)
. words
is a MapPartitionsRDD(txt, flatMapFunction)
. We speak of the lineage
of words as the DAG (Direct Acyclic Graph) that is formed of this chaining of operations.: HadoopRDD <-- MapPartitionsRDD
.
We can apply the same principles to our streaming operation:
At iteration 0, we have
var history: RDD[(String, List[String]) = sc.emptyRDD()
// -> history: EmptyRDD
...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd.union(EmptyRDD)
join, filter
// underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred)
map
// -> underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.unpersist(false)
// EmptyRDD.unpersist (does nothing, it was never persisted)
history = formatted
// history = ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.persist(...)
// history marked for persistence (at the next action)
history.count()
// ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f).count()
// cache result of: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
At iteration 1, we have (adding rdd0, rdd1 as iteration index):
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f))
join, filter
// underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred)
map
// -> underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.unpersist(false)
// history0.unpersist (marks the previous result for removal, we used it already for our computation above)
history = formatted
// history1 = ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.persist(...)
// new history marked for persistence (at the next action)
history.count()
// ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f).count()
// cache result sothat we don't need to compute it next time
This iterative process goes on with each iteration.
As we can see, the graph representing the RDD computation keeps on growing. cache
reduces the cost of making all calculations each time. checkpoint
is needed every so often to write a concrete computed value of this growing graph so that we can use it as baseline instead of having to evaluate the whole chain.
An interesting way to see this process in action is by adding a line within the foreachRDD to inspect the current lineage:
...
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
println(history.toDebugString())
...