PySpark: spit out single file when writing instead of multiple part files

前端未结

关注

 3  1178

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format(\'json\').save(


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2021-01-02 13:15
              
            
            
                                                                       
df1.rdd.repartition(1).write.json('myfile.json')

Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2021-01-02 13:25
              
            
            
                                                                       
Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')


P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2021-01-02 13:30
              
            
            
                                                                       
This was a better solution for me.

rdd.map(json.dumps)
    .saveAsTextFile(json_lines_file_name)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复