Getting strange behavior when calling function outside of a closure:
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers. When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException
by refactoring my code so that the closure did not access a non-serializable final
field.