问题
Hardest thing in Spark is Serialization imho. This https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 I looked at some time ago and I think I am pretty sure I get it, the Object aspects. I run the code and it is as per the examples.
However, I am curious on a few other aspects when testing in a Notebook on a Databricks Community Edition account - not a real cluster BTW. I did check, confirm also on a Spark Standalone cluster via the spark-shell.
This does not work as per the article in the link for an RDD; understood that Example is not serializable:
object Example {
val r = 1 to 1000000 toList
val rdd = sc.parallelize(r,3)
val num = 1
val rdd2 = rdd.map(_ + num) // does not work, but + 1 does obviously
rdd2.collect
}
Example
But why does this does work OK for a DF, the following which emulates the 1st example of an external variable, but not using an RDD, why do I not get a Serialization error? Here object Example is serializable or is there some Databricks or DF aspect I am not getting?
object Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val n = 1
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
df.repartition(3).withColumn("plus1", $"b" + n).show(false)
}
Example
Thanks in advance. I am thinking I am missing a key point on the Databricks environment. I would at least expect consistent behaviour on Serialization issues using Databricks CE.
I am wondering if there is an issue with the article as few develop like this.
来源:https://stackoverflow.com/questions/62511615/serialization-issues-df-vs-rdd