How to checkpoint DataFrames?

后端未结

关注

 5  1576

一整个雨季 2021-02-02 09:16

I\'m looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can\'t find how to do it with DataFrames. persist and cache (which are synon

5条回答

说谎 (楼主)

2021-02-02 09:37

TL;DR: For Spark versions up to 1.6, to actually get a "checkpointed DF", my suggested solution is based on another answer, but with one extra line:

df.rdd.checkpoint
df.rdd.count
val df2 = sqlContext.createDataFrame(df.rdd, df.schema)
// df2 is checkpointed

Explanation

Updated after further research.

As pointed out, checkpointing a DataFrame directly is not currently (Spark 1.6.1) possible, though there is an issue for it on Spark's Jira.

So, a possible workaround is the one suggested on another answer:

df.rdd.checkpoint // Assuming the checkpoint dir has already been set
df.count // An action to compute the checkpoint

However, with this approach, only the df.rdd object will be checkpointed. This can be verified by calling toDebugString to df.rdd:

 scala> df.rdd.toDebugString
 (32) MapPartitionsRDD[1] at rdd at :38 []
  |   ReliableCheckpointRDD[2] at count at :38 []

and then calling toDebugString after a quick transformation to df (please note that I created my DataFrame from a JDBC source), returns the following:

scala> df.withColumn("new_column", lit(0)).rdd.toDebugString
res4: String =
(32) MapPartitionsRDD[5] at rdd at :38 []
 |   MapPartitionsRDD[4] at rdd at :38 []
 |   JDBCRDD[3] at rdd at :38 []

df.explain also shows a hint:

scala> df.explain
== Physical Plan ==
Scan JDBCRelation (...)

So, to actually achieve a "checkpointed" DataFrame, I can only think of creating a new one from the checkpointed RDD:

val newDF = sqlContext.createDataFrame(df.rdd, df.schema)
// or
val newDF = df.rdd.map { 
  case Row(val1: Int, ..., valN: Int) => (val1, ..., valN)
}.toDF("col1", ..., "colN")

Then we can verify that the new DataFrame is "checkpointed":

1) newDF.explain:

scala> newDF.explain
== Physical Plan ==
Scan PhysicalRDD[col1#5, col2#6, col3#7]

2) newDF.rdd.toDebugString:

scala> newDF.rdd.toDebugString
res7: String =
(32) MapPartitionsRDD[10] at rdd at :40 []
 |   MapPartitionsRDD[8] at createDataFrame at :37 []
 |   MapPartitionsRDD[1] at rdd at :38 []
 |   ReliableCheckpointRDD[2] at count at :38 []

3) With transformation:

scala> newDF.withColumn("new_column", lit(0)).rdd.toDebugString
res9: String =
(32) MapPartitionsRDD[12] at rdd at :40 []
 |   MapPartitionsRDD[11] at rdd at :40 []
 |   MapPartitionsRDD[8] at createDataFrame at :37 []
 |   MapPartitionsRDD[1] at rdd at :38 []
 |   ReliableCheckpointRDD[2] at count at :38 []

Also, I tried some more complex transformations and I was able to check, in practice, that the newDF object was checkpointed.

Therefore, the only way I found to reliably checkpoint a DataFrame was by checkpointing its associated RDD and creating a new DataFrame object from it.

I hope it helps. Cheers.

0 讨论(0)

查看其它5个回答