Serialization issues DF vs. RDD

问题

Hardest thing in Spark is Serialization imho. This https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 I looked at some time ago and I think I am pretty sure I get it, the Object aspects. I run the code and it is as per the examples.

However, I am curious on a few other aspects when testing in a Notebook on a Databricks Community Edition account - not a real cluster BTW. I did check, confirm also on a Spark Standalone cluster via the spark-shell.

This does not work as per the article in the link for an RDD; understood that Example is not serializable:

object Example {
val r = 1 to 1000000 toList
val rdd = sc.parallelize(r,3)
val num = 1
val rdd2 = rdd.map(_ + num) // does not work, but + 1 does obviously
rdd2.collect  
}

Example

But why does this does work OK for a DF, the following which emulates the 1st example of an external variable, but not using an RDD, why do I not get a Serialization error? Here object Example is serializable or is there some Databricks or DF aspect I am not getting?

object Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val n = 1  
val df = sc.parallelize(Seq(
 ("r1", 1, 1),
 ("r2", 6, 4),
 ("r3", 4, 1),
 ("r4", 1, 2)
)).toDF("ID", "a", "b")
df.repartition(3).withColumn("plus1", $"b" + n).show(false)
}

Example

Thanks in advance. I am thinking I am missing a key point on the Databricks environment. I would at least expect consistent behaviour on Serialization issues using Databricks CE.

I am wondering if there is an issue with the article as few develop like this.

来源：https://stackoverflow.com/questions/62511615/serialization-issues-df-vs-rdd

标签

scala

apache-spark

serialization

apache-spark-sql

rdd