How to create a custom Encoder in Spark 2.X Datasets?

筅森魡賤 提交于 2019-12-31 12:21:05

问题


Spark Datasets move away from Row's to Encoder's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations.

Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime:

//mapping each row to RDD tuple
df.map(row => {
    var id: String = if (!has_id) "" else row.getAs[String]("id")
    var label: String = row.getAs[String]("label")
    val channels  : Int = if (!has_channels) 0 else row.getAs[Int]("channels")
    val height  : Int = if (!has_height) 0 else row.getAs[Int]("height")
    val width : Int = if (!has_width) 0 else row.getAs[Int]("width")
    val data : Array[Byte] = row.getAs[Any]("data") match {
      case str: String => str.getBytes
      case arr: Array[Byte@unchecked] => arr
      case _ => {
        log.error("Unsupport value type")
        null
      }
    }
    (id, label, channels, height, width, data)
  }).persist(StorageLevel.DISK_ONLY)

}

We get a compiler error of

Error:(56, 11) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported 
by importing spark.implicits._  Support for serializing other types will be added in future releases.
    df.map(row => {
          ^

So then somehow/somewhere there should be a means to

  • Define/implement our custom Encoder
  • Apply it when performing a mapping on the DataFrame (which is now a Dataset of type Row)
  • Register the Encoder for use by other custom code

I am looking for code that successfully performs these steps.


回答1:


As far as I am aware nothing really changed since 1.6 and the solutions described in How to store custom objects in Dataset? are the only available options. Nevertheless your current code should work just fine with default encoders for product types.

To get some insight why your code worked in 1.x and may not work in 2.0.0 you'll have to check the signatures. In 1.x DataFrame.map is a method which takes function Row => T and transforms RDD[Row] into RDD[T].

In 2.0.0 DataFrame.map takes a function of type Row => T as well, but transforms Dataset[Row] (a.k.a DataFrame) into Dataset[T] hence T requires an Encoder. If you want to get the "old" behavior you should use RDD explicitly:

df.rdd.map(row => ???)

For Dataset[Row] map see Encoder error while trying to map dataframe row to updated row




回答2:


Did you import the implicit encoders?

import spark.implicits._

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder




回答3:


I imported spark.implicits._ Where spark is the SparkSession and it solved the error and custom encoders got imported.

Also, writing a custom encoder is a way out which I've not tried.

Working solution:- Create SparkSession and import the following

import spark.implicits._



来源:https://stackoverflow.com/questions/37706420/how-to-create-a-custom-encoder-in-spark-2-x-datasets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!