According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom
Using generic encoders.
There are two generic encoders available for now kryo and javaSerialization where the latter one is explicitly described as:
extremely inefficient and should only be used as the last resort.
Assuming following class
class Bar(i: Int) {
override def toString = s"bar $i"
def bar = i
you can use these encoders by adding implicit encoder:
object BarEncoders {
implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] =
which can be used together as follows:
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarEncoders._
val ds = Seq(new Bar(1)).toDS
It stores objects as binary
column so when converted to DataFrame
you get following schema:
|-- value: binary (nullable = true)
It is also possible to encode tuples using kryo
encoder for specific field:
val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
// org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
Please note that we don't depend on implicit encoders here but pass encoder explicitly so this most likely won't work with toDS
Using implicit conversions:
Provide implicit conversions between representation which can be encoded and custom class, for example:
object BarConversions {
implicit def toInt(bar: Bar): Int =
implicit def toBar(i: Int): Bar = new Bar(i)
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarConversions._
type EncodedBar = Int
val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1)))
val barsDS = bars.toDS
Related questions: