问题
Weirdly, I found several times there's difference when running with spark-submit vs running with spark-shell (or zeppelin), though I don't believe it.
With some codes, spark-shell (or zeppelin) can throw this exception, while spark-submit just works fine:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:843)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:843)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
This is an example of code (will try to simplify the example) that can cause the problem:
import java.time.format.DateTimeFormatter
import java.time.LocalDate
def formatter1 = DateTimeFormatter.ofPattern("MM_dd_yy")
val date1 = udf((date: String) => {val d = date.split("_").map(x => {if (x.length < 2) "0" + x else x}).mkString("_"); LocalDate.from(formatter1.parse(d)).toString})
import org.apache.spark.sql.{DataFrame}
def melt(toPreserve: Seq[String], toMelt: Seq[String], column: String, row: String, df: DataFrame) : DataFrame = {
val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = toPreserve.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
_tmp.select(cols: _*)
}
val cNullState = melt(preserves, melts, "Date", "Confirmed", confirmed).withColumn("Date", date1(col("Date")))
Also, this phenomenon is unstable, sometimes happens, sometimes not.
I understand the basics of "Task not serializable" about sending code to each node etc., but just in this specific example, I could not figure out.
- What's wrong with this code?
- If there's something wrong, why spark-submit works fine?
- If there's nothing wrong, why spark-shell or zeppelin throws exception?
UPDATE: I found the reason though haven't understood completely: it's caused by
.withColumn("Date", date1(col("Date")))
where date1
udf contains something from java time. But why java time has the problem, I don't know. Title updated to contain "java time".
来源:https://stackoverflow.com/questions/60922044/task-not-serializable-with-java-time-in-spark-shell-or-zeppelin-but-not-in-s