How to store custom objects in Dataset?

前端 未结 9 1022
别那么骄傲
别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

9条回答
  •  无人及你
    2020-11-22 02:08

    1. Using generic encoders.

      There are two generic encoders available for now kryo and javaSerialization where the latter one is explicitly described as:

      extremely inefficient and should only be used as the last resort.

      Assuming following class

      class Bar(i: Int) {
        override def toString = s"bar $i"
        def bar = i
      }
      

      you can use these encoders by adding implicit encoder:

      object BarEncoders {
        implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] = 
        org.apache.spark.sql.Encoders.kryo[Bar]
      }
      

      which can be used together as follows:

      object Main {
        def main(args: Array[String]) {
          val sc = new SparkContext("local",  "test", new SparkConf())
          val sqlContext = new SQLContext(sc)
          import sqlContext.implicits._
          import BarEncoders._
      
          val ds = Seq(new Bar(1)).toDS
          ds.show
      
          sc.stop()
        }
      }
      

      It stores objects as binary column so when converted to DataFrame you get following schema:

      root
       |-- value: binary (nullable = true)
      

      It is also possible to encode tuples using kryo encoder for specific field:

      val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
      
      spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
      // org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
      

      Please note that we don't depend on implicit encoders here but pass encoder explicitly so this most likely won't work with toDS method.

    2. Using implicit conversions:

      Provide implicit conversions between representation which can be encoded and custom class, for example:

      object BarConversions {
        implicit def toInt(bar: Bar): Int = bar.bar
        implicit def toBar(i: Int): Bar = new Bar(i)
      }
      
      object Main {
        def main(args: Array[String]) {
          val sc = new SparkContext("local",  "test", new SparkConf())
          val sqlContext = new SQLContext(sc)
          import sqlContext.implicits._
          import BarConversions._
      
          type EncodedBar = Int
      
          val bars: RDD[EncodedBar]  = sc.parallelize(Seq(new Bar(1)))
          val barsDS = bars.toDS
      
          barsDS.show
          barsDS.map(_.bar).show
      
          sc.stop()
        }
      }
      

    Related questions:

    • How to create encoder for Option type constructor, e.g. Option[Int]?

提交回复
热议问题