When I write an RDD transformation, e.g.
val rdd = sc.parallelise(1 to 1000)
rdd.map(x => x * 3)
I understand that the closure (x =&
The closures are most certainly serialized at runtime. I have plenty of instances seen Closure Not Serializable exceptions at runtime - from pyspark and from scala. There is complex code called
From ClosureCleaner.scala
def clean(
closure: AnyRef,
checkSerializable: Boolean = true,
cleanTransitively: Boolean = true): Unit = {
clean(closure, checkSerializable, cleanTransitively, Map.empty)
}
that attempts to minify the code being serialized. The code is then sent across the wire - if it were serializable. Otherwise an exception will be thrown.
Here is another excerpt from ClosureCleaner to check the ability to serialize an incoming function:
private def ensureSerializable(func: AnyRef) {
try {
if (SparkEnv.get != null) {
SparkEnv.get.closureSerializer.newInstance().serialize(func)
}
} catch {
case ex: Exception => throw new SparkException("Task not serializable", ex)
}
}