How to checkpoint DataFrames?

后端未结

关注

 5  1567

一整个雨季 2021-02-02 09:16

I\'m looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can\'t find how to do it with DataFrames. persist and cache (which are synon

5条回答

一整个雨季 (楼主)

2021-02-02 09:50

Extending to Assaf Mendelson answer,

As of today Spark version 2.2, DataSet#checkpoint() API is Evolving and Experimental

Usage:

Before checkpoint CheckpointDir has to be mentioned using SparkContext

spark.sparkContext.setCheckpointDir("checkpoint/dir/location")

val ds: Dataset[Long] = spark.range(10).repartition('id % 2)

// do checkpoint now, it will preserve partition also
val cp: Dataset[Long] = ds.checkpoint()

How is works internally?

So far the implementation for DataSet checkpoint is to convert the DataSet to RDD then checkpoint it.

// In DataSet.scala 

//API we used in example  
def checkpoint(): Dataset[T] = checkpoint(eager = true)

//Base implementation
def checkpoint(eager: Boolean): Dataset[T] = {
    val internalRdd = queryExecution.toRdd.map(_.copy())
    internalRdd.checkpoint()

    if (eager) {
      internalRdd.count() //To materialize DataSet immediately on checkpoint() call
    }

  ...
}

0 讨论(0)

查看其它5个回答