Spark - How to write a single csv file WITHOUT folder?

前端未结

关注

 9  1170

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is

df.coalesce(1).write.option(\"header\", \"tru


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-28 14:06
              
            
            
                                                                       
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).

 1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
    "true").csv("PATH/FOLDER_NAME/x.csv")  



2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
        "true").csv("PATH/FOLDER_NAME/x.csv")


If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.

You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-28 14:11
              
            
            
                                                                       
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'

def file_exists(path):
  try:
    dbutils.fs.ls(path)
    return True
  except Exception as e:
    if 'java.io.FileNotFoundException' in str(e):
      return False
    else:
      raise

if file_exists(fpath):
  dbutils.fs.rm(fpath)
  df.coalesce(1).write.option("header", "true").csv(fpath)
else:
  df.coalesce(1).write.option("header", "true").csv(fpath)

fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True) 


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-12-28 14:13
              
            
            
                                                                       
I had the same problem and used python's NamedTemporaryFile library to solve this.

from tempfile import NamedTemporaryFile

s3 = boto3.resource('s3')

with NamedTemporaryFile() as tmp:
    df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
    s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')


https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-28 14:17
              
            
            
                                                                       
A more databricks'y' solution is here:

TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"

spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)

temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])

dbutils.fs.cp(temporary_csv, DESIRED_TARGET)


Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-12-28 14:21
              
            
            
                                                                       
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

df.toPandas().to_csv("<path>/<filename>")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2020-12-28 14:22
              
            
            
                                                                       
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.

I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:

import os
import shutil

TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"

df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)

part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)

shutil.copyfile(temporary_csv, DESIRED_TARGET)


If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复