How to create a schema from CSV file and persist/save that schema to a file?

后端 未结 1 894
傲寒
傲寒 2021-01-14 14:10

I have CSV file with 10 columns. Half String and half are Integers.

What is the Scala code to:

  • Create (infer) the schema
  • Save that schema to
1条回答
  •  攒了一身酷
    2021-01-14 14:39

    DataType API provided all the required utilities so JSON is a natural choice:

    import org.apache.spark.sql.types._
    import scala.util.Try
    
    val df = Seq((1L, "foo", 3.0)).toDF("id", "x1", "x2")
    val serializedSchema: String = df.schema.json
    
    
    def loadSchema(s: String): Option[StructType] =
      Try(DataType.fromJson(s)).toOption.flatMap {
        case s: StructType => Some(s)
        case _ => None 
      }
    
    loadSchema(serializedSchema)
    

    Depending on you requirements you can use standard Scala methods to write this to file, or hack Spark RDD:

    val schemaPath: String = ???
    
    sc.parallelize(Seq(serializedSchema), 1).saveAsTextFile(schemaPath)
    val loadedSchema: Option[StructType] = sc.textFile(schemaPath)
      .map(loadSchema)  // Load
      .collect.headOption.flatten  // Make sure we don't fail if there is no data
    

    For a Python equivalent see Config file to define JSON Schema Struture in PySpark

    0 讨论(0)
提交回复
热议问题