I have CSV file with 10 columns. Half String and half are Integers.
What is the Scala code to:
DataType
API provided all the required utilities so JSON is a natural choice:
import org.apache.spark.sql.types._
import scala.util.Try
val df = Seq((1L, "foo", 3.0)).toDF("id", "x1", "x2")
val serializedSchema: String = df.schema.json
def loadSchema(s: String): Option[StructType] =
Try(DataType.fromJson(s)).toOption.flatMap {
case s: StructType => Some(s)
case _ => None
}
loadSchema(serializedSchema)
Depending on you requirements you can use standard Scala methods to write this to file, or hack Spark RDD
:
val schemaPath: String = ???
sc.parallelize(Seq(serializedSchema), 1).saveAsTextFile(schemaPath)
val loadedSchema: Option[StructType] = sc.textFile(schemaPath)
.map(loadSchema) // Load
.collect.headOption.flatten // Make sure we don't fail if there is no data
For a Python equivalent see Config file to define JSON Schema Struture in PySpark