“Task not serializable” with java time in Spark-shell (or zeppelin) but not in spark-submit

 ̄綄美尐妖づ 提交于 2020-12-15 08:59:16

问题


Weirdly, I found several times there's difference when running with spark-submit vs running with spark-shell (or zeppelin), though I don't believe it.

With some codes, spark-shell (or zeppelin) can throw this exception, while spark-submit just works fine:

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:844)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:843)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:843)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)

This is an example of code (will try to simplify the example) that can cause the problem:

import java.time.format.DateTimeFormatter
    import java.time.LocalDate

    def formatter1 = DateTimeFormatter.ofPattern("MM_dd_yy")
    val date1 = udf((date: String) => {val d = date.split("_").map(x => {if (x.length < 2) "0" + x else x}).mkString("_"); LocalDate.from(formatter1.parse(d)).toString})

import org.apache.spark.sql.{DataFrame}
    def melt(toPreserve: Seq[String], toMelt: Seq[String], column: String, row: String, df: DataFrame) : DataFrame = {
      val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
      val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
      val cols = toPreserve.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
      _tmp.select(cols: _*)
    }
val cNullState = melt(preserves, melts, "Date", "Confirmed", confirmed).withColumn("Date", date1(col("Date")))

Also, this phenomenon is unstable, sometimes happens, sometimes not.

I understand the basics of "Task not serializable" about sending code to each node etc., but just in this specific example, I could not figure out.

  1. What's wrong with this code?
  2. If there's something wrong, why spark-submit works fine?
  3. If there's nothing wrong, why spark-shell or zeppelin throws exception?

UPDATE: I found the reason though haven't understood completely: it's caused by

.withColumn("Date", date1(col("Date")))

where date1 udf contains something from java time. But why java time has the problem, I don't know. Title updated to contain "java time".

来源:https://stackoverflow.com/questions/60922044/task-not-serializable-with-java-time-in-spark-shell-or-zeppelin-but-not-in-s

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!