Spark default null columns DataSet

前端未结

关注

 2  1689

一整个雨季 2021-02-06 19:19

I cannot make Spark read a json (or csv for that matter) as Dataset of a case class with Option[_] fields where not all fields are defined

2条回答

情话喂你 (楼主)

2021-02-06 19:39

I'll put an answer down here. To show you what (sort of) works, but looks very hacky IMHO.

By extending the DataFrame with a method to force the StructType of a case class on top of the already existing StructType it actually works, but maybe (I really hope) there are better / cleaner solutions.

Here goes:

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.ScalaReflection
import scala.reflect.runtime.universe._

case class DataFrameExtended(dataFrame: DataFrame) {

  def forceMergeSchema[T: TypeTag]: DataFrame = {
    ScalaReflection
      .schemaFor[T]
      .dataType
      .asInstanceOf[StructType]
      .filterNot(
        field => dataFrame.columns.contains(field.name)
      )
      .foldLeft(dataFrame){
        case (newDf, field) => newDf.withColumn(field.name, lit(null).cast(field.dataType))
      }
  }
}

implicit def dataFrameExtended(df: DataFrame): DataFrameExtended = {
  DataFrameExtended(df)
}

val ds2 = spark
  .read
  .option("mode", "PERMISSIVE")
  .json("src/main/resources/customB.json")
  .forceMergeSchema[CustomData]
  .as[CustomData]
  .show()

Now show a result I was hoping for:

+----+---+----+
|colA| id|colB|
+----+---+----+
|   x|321|null|
|   y|654|null|
|null|987|null|
+----+---+----+

I've tried this only with scalar types like (Int, String, etc) I think more complex structures will fail horribly. So I'm still looking for the better answer.

0 讨论(0)

查看其它2个回答