Write single CSV file using spark-csv

前端 未结 13 1868
心在旅途
心在旅途 2020-11-22 08:43

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take

13条回答
  •  既然无缘
    2020-11-22 09:03

    If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:

    val fileprefix= "/mnt/aws/path/file-prefix"
    
    dataset
      .coalesce(1)       
      .write             
    //.mode("overwrite") // I usually don't use this, but you may want to.
      .option("header", "true")
      .option("delimiter","\t")
      .csv(fileprefix+".tmp")
    
    val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
         .filter(file=>file.name.endsWith(".csv"))(0).path
    
    dbutils.fs.cp(partition_path,fileprefix+".tab")
    
    dbutils.fs.rm(fileprefix+".tmp",recurse=true)
    

    If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.

    This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.

    The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.

提交回复
热议问题