How to access sub-entities in JSON file?

后端 未结 1 1216
闹比i
闹比i 2020-12-04 04:01

I have a json file look like this:

{
  \"employeeDetails\":{
    \"name\": \"xxxx\",
    \"num\":\"415\"
  },
  \"work\":[
    {
      \"monthYear\":\"01/200         


        
相关标签:
1条回答
  • given input columns: [_corrupt_record];;

    The reason is that Spark supports JSON files in which "Each line must contain a separate, self-contained valid JSON object."

    Quoting JSON Datasets:

    Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

    In case a JSON file is incorrect for Spark it will store it under _corrupt_record (that you can change using columnNameOfCorruptRecord option).

    scala> spark.read.json("employee.json").printSchema
    root
     |-- _corrupt_record: string (nullable = true)
    

    And your file is incorrect not only bacause it's a multi-line JSON, but also because jq (a lightweight and flexible command-line JSON processor) says so.

    $ cat incorrect.json
    {
      "employeeDetails":{
        "name": "xxxx",
        "num:"415"
      }
      "work":[
      {
        "monthYear":"01/2007"
        "workdate":"1|2|3|....|31",
        "workhours":"8|8|8....|8"
      },
      {
        "monthYear":"02/2007"
        "workdate":"1|2|3|....|31",
        "workhours":"8|8|8....|8"
      }
      ],
    }
    $ cat incorrect.json | jq
    parse error: Expected separator between values at line 4, column 14
    

    Once you fix the JSON file, use the following trick to load the multi-line JSON file.

    scala> spark.version
    res5: String = 2.1.1
    
    val employees = spark.read.json(sc.wholeTextFiles("employee.json").values)
    scala> employees.printSchema
    root
     |-- employeeDetails: struct (nullable = true)
     |    |-- name: string (nullable = true)
     |    |-- num: string (nullable = true)
     |-- work: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- monthYear: string (nullable = true)
     |    |    |-- workdate: string (nullable = true)
     |    |    |-- workhours: string (nullable = true)
    
    scala> employees.select("employeeDetails").show()
    +---------------+
    |employeeDetails|
    +---------------+
    |     [xxxx,415]|
    +---------------+
    

    Spark >= 2.2

    As of Spark 2.2 (released quite recently and highly recommended to use), you should use multiLine option instead. multiLine option was added in SPARK-20980 Rename the option wholeFile to multiLine for JSON and CSV.

    scala> spark.version
    res0: String = 2.2.0
    
    scala> spark.read.option("multiLine", true).json("employee.json").printSchema
    root
     |-- employeeDetails: struct (nullable = true)
     |    |-- name: string (nullable = true)
     |    |-- num: string (nullable = true)
     |-- work: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- monthYear: string (nullable = true)
     |    |    |-- workdate: string (nullable = true)
     |    |    |-- workhours: string (nullable = true)
    
    0 讨论(0)
提交回复
热议问题