Write single CSV file using spark-csv

前端 未结 13 1847
心在旅途
心在旅途 2020-11-22 08:43

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take

相关标签:
13条回答
  • 2020-11-22 09:10

    There is one more way to use Java

    import java.io._
    
    def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) 
      {
         val p = new java.io.PrintWriter(f);  
         try { op(p) } 
         finally { p.close() }
      } 
    
    printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}
    
    0 讨论(0)
  • 2020-11-22 09:15

    If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    
    def merge(srcPath: String, dstPath: String): Unit =  {
       val hadoopConfig = new Configuration()
       val hdfs = FileSystem.get(hadoopConfig)
       FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null) 
       // the "true" setting deletes the source files once they are merged into the new output
    }
    
    
    val newData = << create your dataframe >>
    
    
    val outputfile = "/user/feeds/project/outputs/subject"  
    var filename = "myinsights"
    var outputFileName = outputfile + "/temp_" + filename 
    var mergedFileName = outputfile + "/merged_" + filename
    var mergeFindGlob  = outputFileName
    
        newData.write
            .format("com.databricks.spark.csv")
            .option("header", "false")
            .mode("overwrite")
            .save(outputFileName)
        merge(mergeFindGlob, mergedFileName )
        newData.unpersist()
    

    Can't remember where I learned this trick, but it might work for you.

    0 讨论(0)
  • 2020-11-22 09:17

    you can use rdd.coalesce(1, true).saveAsTextFile(path)

    it will store data as singile file in path/part-00000

    0 讨论(0)
  • 2020-11-22 09:20

    repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)

    0 讨论(0)
  • 2020-11-22 09:23

    I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.

    I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.

    EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.

    EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?

    0 讨论(0)
  • 2020-11-22 09:25
    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
    import org.apache.spark.sql.functions._
    

    I solved using below approach (hdfs rename file name):-

    Step 1:- (Crate Data Frame and write to HDFS)

    df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
    

    Step 2:- (Create Hadoop Config)

    val hadoopConfig = new Configuration()
    val hdfs = FileSystem.get(hadoopConfig)
    

    Step3 :- (Get path in hdfs folder path)

    val pathFiles = new Path("/hdfsfolder/blah/")
    

    Step4:- (Get spark file names from hdfs folder)

    val fileNames = hdfs.listFiles(pathFiles, false)
    println(fileNames)
    

    setp5:- (create scala mutable list to save all the file names and add it to the list)

        var fileNamesList = scala.collection.mutable.MutableList[String]()
        while (fileNames.hasNext) {
          fileNamesList += fileNames.next().getPath.getName
        }
        println(fileNamesList)
    

    Step 6:- (filter _SUCESS file order from file names scala list)

        // get files name which are not _SUCCESS
        val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
    

    step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)

    val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
        val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
        hdfs.rename(partFileSourcePath , desiredCsvTargetPath)
    
    0 讨论(0)
提交回复
热议问题