Pyspark Schema for Json file

你。 提交于 2021-02-19 08:14:06

问题


I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting

arrayfield:[{"name":"somename"},{"address" : "someadress"}]

Right now the data is as below

arrayfield:[]

what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading the file but since the json structure is really complex writing it from scratch doesn't really work out. I tried getting the schema using df.schema(which displays in StructType) and modifying it as per my requirement but how do i pass back a string into a StructType ? This might be really silly but i am finding it hard to fix this. Are there any tool / utility which would help me generate the strutType


回答1:


You need to pass StructType object to DF constructor.

Let's say Your DF with mistakes after executing

df.schema

prints output like this:

StructType(List(StructField(data1,StringType,true),StructField(data2,StringType,true)))

so, You need need translate this string into executable script.

  1. Add an import for types

    from pyspark.sql.types import *
    
  2. Change List and parentheses to python's brackets

    List() -> []
    
  3. After each type declaration add parentheses

    StringType -> StringType()
    
  4. Fix boolean values strings

    true -> True
    
  5. Assign it to variable

    schema = StructType([
            StructField("data1", StringType(),True),
            StructField("data2", StringType(),True)])
    
  6. Create new DF object

    spark.read.csv(path, schema=schema)
    

And You are done.



来源:https://stackoverflow.com/questions/44585520/pyspark-schema-for-json-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!