问题
I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting
arrayfield:[{"name":"somename"},{"address" : "someadress"}]
Right now the data is as below
arrayfield:[]
what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading the file but since the json structure is really complex writing it from scratch doesn't really work out. I tried getting the schema using df.schema(which displays in StructType) and modifying it as per my requirement but how do i pass back a string into a StructType ? This might be really silly but i am finding it hard to fix this. Are there any tool / utility which would help me generate the strutType
回答1:
You need to pass StructType object to DF constructor.
Let's say Your DF with mistakes after executing
df.schema
prints output like this:
StructType(List(StructField(data1,StringType,true),StructField(data2,StringType,true)))
so, You need need translate this string into executable script.
Add an import for types
from pyspark.sql.types import *
Change List and parentheses to python's brackets
List() -> []
After each type declaration add parentheses
StringType -> StringType()
Fix boolean values strings
true -> True
Assign it to variable
schema = StructType([ StructField("data1", StringType(),True), StructField("data2", StringType(),True)])
Create new DF object
spark.read.csv(path, schema=schema)
And You are done.
来源:https://stackoverflow.com/questions/44585520/pyspark-schema-for-json-file