According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom
Encoders work more or less the same in Spark2.0
. And Kryo
is still the recommended serialization
choice.
You can look at following example with spark-shell
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders
scala> case class NormalPerson(name: String, age: Int) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class NormalPerson
scala> case class ReversePerson(name: Int, age: String) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class ReversePerson
scala> val normalPersons = Seq(
| NormalPerson("Superman", 25),
| NormalPerson("Spiderman", 17),
| NormalPerson("Ironman", 29)
| )
normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
scala> val ds1 = sc.parallelize(normalPersons).toDS
ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds1.show()
+---------+---+
| name|age|
+---------+---+
| Superman| 25|
|Spiderman| 17|
| Ironman| 29|
+---------+---+
scala> ds2.show()
+----+---------+
|name| age|
+----+---------+
| 25| Superman|
| 17|Spiderman|
| 29| Ironman|
+----+---------+
scala> ds1.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Superman. I am 25 years old.
I am Spiderman. I am 17 years old.
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds2.foreach(p => println(p.aboutMe))
I am 17. I am Spiderman years old.
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
Till now] there were no appropriate encoders
in present scope so our persons were not encoded as binary
values. But that will change once we provide some implicit
encoders using Kryo
serialization.
// Provide Encoders
scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
// Ecoders will be used since they are now present in Scope
scala> val ds3 = sc.parallelize(normalPersons).toDS
ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
// now all our persons show up as binary values
scala> ds3.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
scala> ds4.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
// Our instances still work as expected
scala> ds3.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Spiderman. I am 17 years old.
I am Superman. I am 25 years old.
scala> ds4.foreach(p => println(p.aboutMe))
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
I am 17. I am Spiderman years old.