I cannot make Spark read a json
(or csv for that matter) as Dataset
of a case class with Option[_]
fields where not all fields are defined
Here is an even simpler solution:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.ScalaReflection
import scala.reflect.runtime.universe._
val structSchema = ScalaReflection.schemaFor[CustomData].dataType.asInstanceOf[StructType]
val df = spark.read.schema(structSchema).json(jsonRDD)
I'll put an answer down here. To show you what (sort of) works, but looks very hacky IMHO.
By extending the DataFrame with a method to force the StructType
of a case class on top of the already existing StructType
it actually works, but maybe (I really hope) there are better / cleaner solutions.
Here goes:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.ScalaReflection
import scala.reflect.runtime.universe._
case class DataFrameExtended(dataFrame: DataFrame) {
def forceMergeSchema[T: TypeTag]: DataFrame = {
ScalaReflection
.schemaFor[T]
.dataType
.asInstanceOf[StructType]
.filterNot(
field => dataFrame.columns.contains(field.name)
)
.foldLeft(dataFrame){
case (newDf, field) => newDf.withColumn(field.name, lit(null).cast(field.dataType))
}
}
}
implicit def dataFrameExtended(df: DataFrame): DataFrameExtended = {
DataFrameExtended(df)
}
val ds2 = spark
.read
.option("mode", "PERMISSIVE")
.json("src/main/resources/customB.json")
.forceMergeSchema[CustomData]
.as[CustomData]
.show()
Now show a result I was hoping for:
+----+---+----+
|colA| id|colB|
+----+---+----+
| x|321|null|
| y|654|null|
|null|987|null|
+----+---+----+
I've tried this only with scalar types like (Int, String, etc) I think more complex structures will fail horribly. So I'm still looking for the better answer.