Write single CSV file using spark-csv

前端 未结 13 1898
心在旅途
心在旅途 2020-11-22 08:43

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take

相关标签:
13条回答
  • 2020-11-22 09:01

    I'm using this in Python to get a single file:

    df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)
    
    0 讨论(0)
  • 2020-11-22 09:03

    If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:

    val fileprefix= "/mnt/aws/path/file-prefix"
    
    dataset
      .coalesce(1)       
      .write             
    //.mode("overwrite") // I usually don't use this, but you may want to.
      .option("header", "true")
      .option("delimiter","\t")
      .csv(fileprefix+".tmp")
    
    val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
         .filter(file=>file.name.endsWith(".csv"))(0).path
    
    dbutils.fs.cp(partition_path,fileprefix+".tab")
    
    dbutils.fs.rm(fileprefix+".tmp",recurse=true)
    

    If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.

    This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.

    The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.

    0 讨论(0)
  • 2020-11-22 09:04

    by using Listbuffer we can save data into single file:

    import java.io.FileWriter
    import org.apache.spark.sql.SparkSession
    import scala.collection.mutable.ListBuffer
        val text = spark.read.textFile("filepath")
        var data = ListBuffer[String]()
        for(line:String <- text.collect()){
          data += line
        }
        val writer = new FileWriter("filepath")
        data.foreach(line => writer.write(line.toString+"\n"))
        writer.close()
    
    0 讨论(0)
  • 2020-11-22 09:06

    This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.

    More context on accepted answer

    The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:

    val df = Seq("one", "two", "three").toDF("num")
    df
      .repartition(1)
      .write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
    

    Here's what's outputted:

    Documents/
      tmp/
        mydata.csv/
          _SUCCESS
          part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
    

    N.B. mydata.csv is a folder in the accepted answer - it's not a file!

    How to output a single file with a specific name

    We can use spark-daria to write out a single mydata.csv file.

    import com.github.mrpowers.spark.daria.sql.DariaWriters
    DariaWriters.writeSingleFile(
        df = df,
        format = "csv",
        sc = spark.sparkContext,
        tmpFolder = sys.env("HOME") + "/Documents/better/staging",
        filename = sys.env("HOME") + "/Documents/better/mydata.csv"
    )
    

    This'll output the file as follows:

    Documents/
      better/
        mydata.csv
    

    S3 paths

    You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:

    DariaWriters.writeSingleFile(
        df = df,
        format = "csv",
        sc = spark.sparkContext,
        tmpFolder = "s3a://bucket/data/src",
        filename = "s3a://bucket/data/dest/my_cool_file.csv"
    )
    

    See here for more info.

    Avoiding copyMerge

    copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.

    Source code

    Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.

    PySpark implementation

    It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.

    from pathlib import Path
    home = str(Path.home())
    data = [
        ("jellyfish", "JALYF"),
        ("li", "L"),
        ("luisa", "LAS"),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["word", "expected"])
    df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
    

    Limitations

    The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.

    0 讨论(0)
  • 2020-11-22 09:08

    A solution that works for S3 modified from Minkymorgan.

    Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.

    /**
    * Merges multiple partitions of spark text file output into single file. 
    * @param srcPath source directory of partitioned files
    * @param dstPath output path of individual path
    * @param deleteSource whether or not to delete source directory after merging
    * @param spark sparkSession
    */
    def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit =  {
      import org.apache.hadoop.fs.FileUtil
      import java.net.URI
      val config = spark.sparkContext.hadoopConfiguration
      val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
      FileUtil.copyMerge(
        fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
      )
    }
    
    0 讨论(0)
  • 2020-11-22 09:09

    spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()

    df.coalesce(1).write.csv(filepath,header=True) 
    

    will create folder in given filepath with one part-0001-...-c000.csv file use

    cat filepath/part-0001-...-c000.csv > filename_you_want.csv 
    

    to have a user friendly filename

    0 讨论(0)
提交回复
热议问题