Spark default null columns DataSet

前端 未结 2 1686
一整个雨季
一整个雨季 2021-02-06 19:19

I cannot make Spark read a json (or csv for that matter) as Dataset of a case class with Option[_] fields where not all fields are defined

相关标签:
2条回答
  • 2021-02-06 19:21

    Here is an even simpler solution:

        import org.apache.spark.sql.types.StructType
        import org.apache.spark.sql.DataFrame
        import org.apache.spark.sql.functions._
        import org.apache.spark.sql.catalyst.ScalaReflection
        import scala.reflect.runtime.universe._
    
    val structSchema = ScalaReflection.schemaFor[CustomData].dataType.asInstanceOf[StructType]
    val df = spark.read.schema(structSchema).json(jsonRDD)
    
    0 讨论(0)
  • 2021-02-06 19:39

    I'll put an answer down here. To show you what (sort of) works, but looks very hacky IMHO.

    By extending the DataFrame with a method to force the StructType of a case class on top of the already existing StructType it actually works, but maybe (I really hope) there are better / cleaner solutions.

    Here goes:

    import org.apache.spark.sql.types.StructType
    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.catalyst.ScalaReflection
    import scala.reflect.runtime.universe._
    
    case class DataFrameExtended(dataFrame: DataFrame) {
    
      def forceMergeSchema[T: TypeTag]: DataFrame = {
        ScalaReflection
          .schemaFor[T]
          .dataType
          .asInstanceOf[StructType]
          .filterNot(
            field => dataFrame.columns.contains(field.name)
          )
          .foldLeft(dataFrame){
            case (newDf, field) => newDf.withColumn(field.name, lit(null).cast(field.dataType))
          }
      }
    }
    
    implicit def dataFrameExtended(df: DataFrame): DataFrameExtended = {
      DataFrameExtended(df)
    }
    
    val ds2 = spark
      .read
      .option("mode", "PERMISSIVE")
      .json("src/main/resources/customB.json")
      .forceMergeSchema[CustomData]
      .as[CustomData]
      .show()
    

    Now show a result I was hoping for:

    +----+---+----+
    |colA| id|colB|
    +----+---+----+
    |   x|321|null|
    |   y|654|null|
    |null|987|null|
    +----+---+----+
    

    I've tried this only with scalar types like (Int, String, etc) I think more complex structures will fail horribly. So I'm still looking for the better answer.

    0 讨论(0)
提交回复
热议问题