I\'m looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can\'t find how to do it with DataFrames. persist and cache (which are synon
Extending to Assaf Mendelson answer,
As of today Spark version 2.2, DataSet#checkpoint() API is Evolving and Experimental
Before checkpoint CheckpointDir has to be mentioned using SparkContext
spark.sparkContext.setCheckpointDir("checkpoint/dir/location")
val ds: Dataset[Long] = spark.range(10).repartition('id % 2)
// do checkpoint now, it will preserve partition also
val cp: Dataset[Long] = ds.checkpoint()
So far the implementation for DataSet checkpoint is to convert the DataSet to RDD then checkpoint it.
// In DataSet.scala
//API we used in example
def checkpoint(): Dataset[T] = checkpoint(eager = true)
//Base implementation
def checkpoint(eager: Boolean): Dataset[T] = {
val internalRdd = queryExecution.toRdd.map(_.copy())
internalRdd.checkpoint()
if (eager) {
internalRdd.count() //To materialize DataSet immediately on checkpoint() call
}
...
}