How to access sub-entities in JSON file?

后端未结

关注

 1  1217

I have a json file look like this:

{
  \"employeeDetails\":{
    \"name\": \"xxxx\",
    \"num\":\"415\"
  },
  \"work\":[
    {
      \"monthYear\":\"01/200


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2020-12-04 04:30
              
            
            
                                                                       

  given input columns: [_corrupt_record];;


The reason is that Spark supports JSON files in which "Each line must contain a separate, self-contained valid JSON object."

Quoting JSON Datasets:


  Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.


In case a JSON file is incorrect for Spark it will store it under _corrupt_record (that you can change using columnNameOfCorruptRecord option).

scala> spark.read.json("employee.json").printSchema
root
 |-- _corrupt_record: string (nullable = true)


And your file is incorrect not only bacause it's a multi-line JSON, but also because jq (a lightweight and flexible command-line JSON processor) says so.

$ cat incorrect.json
{
  "employeeDetails":{
    "name": "xxxx",
    "num:"415"
  }
  "work":[
  {
    "monthYear":"01/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  },
  {
    "monthYear":"02/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  }
  ],
}
$ cat incorrect.json | jq
parse error: Expected separator between values at line 4, column 14


Once you fix the JSON file, use the following trick to load the multi-line JSON file.

scala> spark.version
res5: String = 2.1.1

val employees = spark.read.json(sc.wholeTextFiles("employee.json").values)
scala> employees.printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

scala> employees.select("employeeDetails").show()
+---------------+
|employeeDetails|
+---------------+
|     [xxxx,415]|
+---------------+


Spark >= 2.2

As of Spark 2.2 (released quite recently and highly recommended to use), you should use multiLine option instead. multiLine option was added in SPARK-20980 Rename the option wholeFile to multiLine for JSON and CSV.

scala> spark.version
res0: String = 2.2.0

scala> spark.read.option("multiLine", true).json("employee.json").printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复