问题
If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:
myDS.persist(StorageLevel.MERORY_ONLY_SER)
Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?
回答1:
Spark Dataset
does not use standard serializers. Instead it uses Encoders
, which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder
, including Row
) into internal binary storage.
The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_]
or Encoders.java[_]
. In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product
encoder, etc.). The only difference compared to Row
is its Encoder
- RowEncoder
(in a sense Encoders
are similar to lenses).
Databricks explicitly puts Encoder
/ Dataset
serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)
Source of the images
- Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Apache Spark Datasets, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
回答2:
Dataset[SomeCaseClass]
is not different from Dataset[Row]
or any other Dataset
. It uses the same internal representation (mapped to instances of external class when needed) and the same serialization method.
Therefore, the is no need for direct object serialization (Java, Kryo).
回答3:
Under the hood, a dataset is an RDD. From the documentation for RDD persistence:
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
By default, Java serialization is used source:
By default, Spark serializes objects using Java’s ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.
To enable Kryo, initialize the job with a SparkConf and set spark.serializer
to org.apache.spark.serializer.KryoSerializer
:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
You may need to register classes with Kryo before creating the SparkContext:
conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))
来源:https://stackoverflow.com/questions/47983465/spark-dataset-serialization