How to store custom objects in Dataset?

前端 未结 9 1000
别那么骄傲
别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

9条回答
  •  感情败类
    2020-11-22 02:17

    @Alec's answer is great! Just to add a comment in this part of his/her answer:

    import spark.implicits._
    case class Wrap[T](unwrap: T)
    class MyObj(val i: Int)
    // ...
    val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
    

    @Alec mentions:

    there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).

    It seems so, because if I add an encoder for MyObj:

    implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
    

    , it still fails:

    java.lang.UnsupportedOperationException: No Encoder found for MyObj
    - field (class: "MyObj", name: "unwrap")
    - root class: "Wrap"
      at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
    

    But notice the important error message:

    root class: "Wrap"

    It actually gives a hint that encoding MyObj isn't enough, and you have to encode the entire chain including Wrap[T].

    So if I do this, it solves the problem:

    implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]
    

    Hence, the comment of @Alec is NOT that true:

    I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)

    We still have a way to feed Spark the encoder for MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj).

提交回复
热议问题