Config file to define JSON Schema Structure in PySpark

后端 未结 2 1457
盖世英雄少女心
盖世英雄少女心 2020-12-07 01:55

I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below

schema = StructType([
    StructFiel         


        
相关标签:
2条回答
  • 2020-12-07 02:13

    You can create a JSON file named schema.json in the below format

    {
      "fields": [
        {
          "metadata": {},
          "name": "first_fields",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "double_field",
          "nullable": true,
          "type": "double"
        }
      ],
      "type": "struct"
    }
    

    Create a struct schema from reading this file

    rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
    text = rdd.collect()[0][1]
    dict = json.loads(str(text))
    custom_schema = StructType.fromJson(dict)
    

    After that, you can use struct as a schema to read the JSON file

    val df=spark.read.json("path", custom_schema)
    
    0 讨论(0)
  • 2020-12-07 02:32

    StructType provides json and jsonValue methods which can be used to obtain json and dict representation respectively and fromJson which can be used to convert Python dictionary to StructType.

    schema = StructType([
        StructField("domain", StringType(), True),
        StructField("timestamp", LongType(), True),                            
    ])
    
    StructType.fromJson(schema.jsonValue())
    

    The only thing you need beyond that is built-in json module to parse input to the dict that can be consumed by StructType.

    For Scala version see How to create a schema from CSV file and persist/save that schema to a file?

    0 讨论(0)
提交回复
热议问题