Read multiline JSON in Apache Spark

前端未结

关注

 2  969

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

val


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-11-30 01:11
              
            
            
                                                                       
Spark >= 2.2

Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")


See:


SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


Spark < 2.2

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems. 

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-11-30 01:20
              
            
            
                                                                       
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).

Therefore, the correct syntax is now:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")


This happened in https://issues.apache.org/jira/browse/SPARK-20980.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复