I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below
schema = StructType([
StructFiel
You can create a JSON file named schema.json in the below format
{
"fields": [
{
"metadata": {},
"name": "first_fields",
"nullable": true,
"type": "string"
},
{
"metadata": {},
"name": "double_field",
"nullable": true,
"type": "double"
}
],
"type": "struct"
}
Create a struct schema from reading this file
rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)
After that, you can use struct as a schema to read the JSON file
val df=spark.read.json("path", custom_schema)
StructType
provides json
and jsonValue
methods which can be used to obtain json
and dict
representation respectively and fromJson
which can be used to convert Python dictionary to StructType
.
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
StructType.fromJson(schema.jsonValue())
The only thing you need beyond that is built-in json module to parse input to the dict
that can be consumed by StructType
.
For Scala version see How to create a schema from CSV file and persist/save that schema to a file?