Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

元气小坏坏 提交于 2021-02-11 14:21:49

问题


There seem to be a few postings on this but none seem to answer what I understand.

The following code run on DataBricks:

spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7")
val checkpointDir = spark.sparkContext.getCheckpointDir.get
val ds = spark.range(10).repartition(2)
ds.cache()
ds.checkpoint()
ds.count()
ds.rdd.isCheckpointed  

Added an improvement of sorts:

...
val ds2 = ds.checkpoint(eager=true)
println(ds2.queryExecution.toRdd.toDebugString)
...

returns:

(2) MapPartitionsRDD[307] at toRdd at command-1835760423149753:13 []
 |  MapPartitionsRDD[305] at checkpoint at command-1835760423149753:12 []
 |  ReliableCheckpointRDD[306] at checkpoint at command-1835760423149753:12 []
 checkpointDir: String = dbfs:/dbfs/FileStore/checkpoint/cp1/loc10/86cc77b5-27c3-4049-9136-503ddcab0fa9
 ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
 ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
 res53: Boolean = false

Question 1:

ds.rdd.isCheckpointed or ds2.rdd.isCheckpointed both return False even though with count I have a non-lazy situation. Why, when in particular the ../loc 7 & 10 are written with (part) files? Also we can see that ReliableCheckPoint!

Not well explained the whole concept. Trying to sort this out.

Question 2 - secondary question:

Is the cache really necessary or not with latest versions of Spark 2.4? A new branch on the ds, if not cached, will it cause re-computation or is that better now? Seems odd the checkpoint data would not be used, or could we say Spark does not really know what is better?

From High Performance Spark I get the mixed impression that check pointing is not so recommended, but then again it is.


回答1:


TL;DR: You don't inspect the object that is actually checkpointed:

ds2.queryExecution.toRdd.dependencies(0).rdd.isCheckpointed
// Boolean = true

ds.rdd.isCheckpointed or ds2.rdd.isCheckpointed both return False

That is an expected behavior. The object being checkpointed is not the converted RDD (which is a result of the additional transformations required to convert to external representation), that you reference, but the internal RDD object (in fact, as you see above, it is not even the latest internal RDD, but its parent).

Additionally, in the first case you just use a wrong Dataset object whatsoever - as explained in the linked answer Dataset.checkpoint returns a new Dataset

even though with count I have a non-lazy situation

That doesn't make much sense. The default checkpoint implementation is eager, therefore it force evaluates. Even if it wasn't for that, Dataset.count is not the right way to force evaluation.

Is the cache really necessary or not with latest version

As you can see in the linked source, Dataset.checkpoint uses RDD.checkpoint internally so the same rule apply. However you already execute a separate action to force checkpoint, so additional caching, especially considering the cost of Dataset persistence, could be an overkill.

Of course, if in doubt, you might consider benchmarking in a specific context.



来源:https://stackoverflow.com/questions/54005223/spark-scala-checkpointing-data-set-showing-ischeckpointed-false-after-action

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!