Spark - How to write a single csv file WITHOUT folder?

前端 未结 9 1159
北恋
北恋 2020-12-28 13:44

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is

df.coalesce(1).write.option(\"header\", \"tru

相关标签:
9条回答
  • 2020-12-28 14:06

    There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

    Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).

     1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
        "true").csv("PATH/FOLDER_NAME/x.csv")  
    
    
    
    2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
            "true").csv("PATH/FOLDER_NAME/x.csv")
    

    If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.

    You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.

    0 讨论(0)
  • 2020-12-28 14:11

    Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.

    fpath=output+'/'+'temp'
    
    def file_exists(path):
      try:
        dbutils.fs.ls(path)
        return True
      except Exception as e:
        if 'java.io.FileNotFoundException' in str(e):
          return False
        else:
          raise
    
    if file_exists(fpath):
      dbutils.fs.rm(fpath)
      df.coalesce(1).write.option("header", "true").csv(fpath)
    else:
      df.coalesce(1).write.option("header", "true").csv(fpath)
    
    fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
    dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
    dbutils.fs.rm(fpath, True) 
    
    
    0 讨论(0)
  • 2020-12-28 14:13

    I had the same problem and used python's NamedTemporaryFile library to solve this.

    from tempfile import NamedTemporaryFile
    
    s3 = boto3.resource('s3')
    
    with NamedTemporaryFile() as tmp:
        df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
        s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
    

    https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()

    0 讨论(0)
  • 2020-12-28 14:17

    A more databricks'y' solution is here:

    TEMPORARY_TARGET="dbfs:/my_folder/filename"
    DESIRED_TARGET="dbfs:/my_folder/filename.csv"
    
    spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
    
    temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
    
    dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
    

    Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()

    0 讨论(0)
  • 2020-12-28 14:21

    A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

    df.toPandas().to_csv("<path>/<filename>")
    
    0 讨论(0)
  • 2020-12-28 14:22

    If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.

    I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:

    import os
    import shutil
    
    TEMPORARY_TARGET="big/storage/name"
    DESIRED_TARGET="/export/report.csv"
    
    df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
    
    part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
    temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
    
    shutil.copyfile(temporary_csv, DESIRED_TARGET)
    

    If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.

    0 讨论(0)
提交回复
热议问题