Use case is to read a file and create a dataframe on top of it.After that get the schema of that file and store into a DB table.
For example purpose I am just creating a
Try this -
//-- For local file
val rdd = spark.read.option("wholeFile", true).option("delimiter",",").csv(s"file:///file/path/file.csv").rdd
val schema = StructType(Seq(StructField("Name", StringType, true),
StructField("Age", IntegerType, true),
StructField("Designation", StringType, true),
StructField("Salary", IntegerType, true),
StructField("ZipCode", IntegerType, true)))
val df = spark.createDataFrame(rdd,schema)
Spark >= 2.4.0
In order to save the schema into a string format you can use the toDDL
method of the StructType
. In your case the DDL format should be:
`Name` STRING, `Age` INT, `Designation` STRING, `Salary` INT, `ZipCode` INT
After saving the schema you can load it from the database and use it as StructType.fromDDL(my_schema)
this will return an instance of StructType which you can use to create the new dataframe with spark.createDataFrame
as @Ajay already mentioned.
Also is useful to remember that you can always extract the schema given a case class with:
import org.apache.spark.sql.catalyst.ScalaReflection
val empSchema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]
And then you can get the DDL representation with empSchema.toDDL
.
Spark < 2.4
For Spark < 2.4 use DataType.fromDDL
and schema.simpleString
accordingly. Also instead of returning a StructType
you should use an DataType
instance omitting the cast to StructType as next:
val empSchema = ScalaReflection.schemaFor[Employee].dataType
Sample output for empSchema.simpleString:
struct<Name:string,Age:int,Designation:string,Salary:int,ZipCode:int>