How to checkpoint DataFrames?

后端 未结 5 1567
一整个雨季
一整个雨季 2021-02-02 09:16

I\'m looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can\'t find how to do it with DataFrames. persist and cache (which are synon

5条回答
  •  一整个雨季
    2021-02-02 09:50

    Extending to Assaf Mendelson answer,

    As of today Spark version 2.2, DataSet#checkpoint() API is Evolving and Experimental

    Usage:

    Before checkpoint CheckpointDir has to be mentioned using SparkContext

    spark.sparkContext.setCheckpointDir("checkpoint/dir/location")
    
    val ds: Dataset[Long] = spark.range(10).repartition('id % 2)
    
    // do checkpoint now, it will preserve partition also
    val cp: Dataset[Long] = ds.checkpoint()
    

    How is works internally?

    So far the implementation for DataSet checkpoint is to convert the DataSet to RDD then checkpoint it.

    // In DataSet.scala 
    
    //API we used in example  
    def checkpoint(): Dataset[T] = checkpoint(eager = true)
    
    //Base implementation
    def checkpoint(eager: Boolean): Dataset[T] = {
        val internalRdd = queryExecution.toRdd.map(_.copy())
        internalRdd.checkpoint()
    
        if (eager) {
          internalRdd.count() //To materialize DataSet immediately on checkpoint() call
        }
    
      ...
    }
    

提交回复
热议问题