How to store custom objects in Dataset?

前端 未结 9 1011
别那么骄傲
别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

相关标签:
9条回答
  • 2020-11-22 02:11

    My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.

    I have been quite successful converting RDD<Fruit> to Dataset<Fruit> using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.

    Step 1: Create the simple Java Bean.

    public class Fruit implements Serializable {
        private String name  = "default-fruit";
        private String color = "default-color";
    
        // AllArgsConstructor
        public Fruit(String name, String color) {
            this.name  = name;
            this.color = color;
        }
    
        // NoArgsConstructor
        public Fruit() {
            this("default-fruit", "default-color");
        }
    
        // ...create getters and setters for above fields
        // you figure it out
    }
    

    I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.

    Step 2: Get your Dataset from the RDD

    SparkSession spark = SparkSession.builder().getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext();
    
    List<Fruit> fruitList = ImmutableList.of(
        new Fruit("apple", "red"),
        new Fruit("orange", "orange"),
        new Fruit("grape", "purple"));
    JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
    
    
    RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
    Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
    Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
    

    And voila! Lather, rinse, repeat.

    0 讨论(0)
  • 2020-11-22 02:14

    Encoders work more or less the same in Spark2.0. And Kryo is still the recommended serialization choice.

    You can look at following example with spark-shell

    scala> import spark.implicits._
    import spark.implicits._
    
    scala> import org.apache.spark.sql.Encoders
    import org.apache.spark.sql.Encoders
    
    scala> case class NormalPerson(name: String, age: Int) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class NormalPerson
    
    scala> case class ReversePerson(name: Int, age: String) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class ReversePerson
    
    scala> val normalPersons = Seq(
     |   NormalPerson("Superman", 25),
     |   NormalPerson("Spiderman", 17),
     |   NormalPerson("Ironman", 29)
     | )
    normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
    
    scala> val ds1 = sc.parallelize(normalPersons).toDS
    ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds1.show()
    +---------+---+
    |     name|age|
    +---------+---+
    | Superman| 25|
    |Spiderman| 17|
    |  Ironman| 29|
    +---------+---+
    
    scala> ds2.show()
    +----+---------+
    |name|      age|
    +----+---------+
    |  25| Superman|
    |  17|Spiderman|
    |  29|  Ironman|
    +----+---------+
    
    scala> ds1.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Superman. I am 25 years old.
    I am Spiderman. I am 17 years old.
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds2.foreach(p => println(p.aboutMe))
    I am 17. I am Spiderman years old.
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    

    Till now] there were no appropriate encoders in present scope so our persons were not encoded as binary values. But that will change once we provide some implicit encoders using Kryo serialization.

    // Provide Encoders
    
    scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
    normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
    
    scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
    reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
    
    // Ecoders will be used since they are now present in Scope
    
    scala> val ds3 = sc.parallelize(normalPersons).toDS
    ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
    
    scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
    ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
    
    // now all our persons show up as binary values
    scala> ds3.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    scala> ds4.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    // Our instances still work as expected    
    
    scala> ds3.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Spiderman. I am 17 years old.
    I am Superman. I am 25 years old.
    
    scala> ds4.foreach(p => println(p.aboutMe))
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    I am 17. I am Spiderman years old.
    
    0 讨论(0)
  • 2020-11-22 02:17

    @Alec's answer is great! Just to add a comment in this part of his/her answer:

    import spark.implicits._
    case class Wrap[T](unwrap: T)
    class MyObj(val i: Int)
    // ...
    val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
    

    @Alec mentions:

    there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).

    It seems so, because if I add an encoder for MyObj:

    implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
    

    , it still fails:

    java.lang.UnsupportedOperationException: No Encoder found for MyObj
    - field (class: "MyObj", name: "unwrap")
    - root class: "Wrap"
      at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
    

    But notice the important error message:

    root class: "Wrap"

    It actually gives a hint that encoding MyObj isn't enough, and you have to encode the entire chain including Wrap[T].

    So if I do this, it solves the problem:

    implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]
    

    Hence, the comment of @Alec is NOT that true:

    I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)

    We still have a way to feed Spark the encoder for MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj).

    0 讨论(0)
提交回复
热议问题