Merge Spark output CSV files with a single header

后端 未结 6 784
天命终不由人
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

相关标签:
6条回答
  • 2021-01-01 11:42
     // Convert JavaRDD  to CSV and save as text file
            outputDataframe.write()
                    .format("com.databricks.spark.csv")
                    // Header => true, will enable to have header in each file
                    .option("header", "true")
    

    Please follow the link with Integration test on how to write a single header

    http://bytepadding.com/big-data/spark/write-a-csv-text-file-from-spark/

    0 讨论(0)
  • 2021-01-01 11:45

    We had a similar issue, following the below approach to get single output file-

    1. Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
    dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
    
    1. Read the files from the previous step and write back to different location on hdfs with coalesce(1).
    dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
    
    dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
    

    This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.

    0 讨论(0)
  • 2021-01-01 11:55
    1. Output the header using dataframe.schema ( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
    2. create a file with the header on dsefs
    3. append all the partition files (headerless) to the file in #2 using hadoop Filesystem API
    0 讨论(0)
  • 2021-01-01 11:58

    To merge files in a folder into one file:

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    
    def merge(srcPath: String, dstPath: String): Unit =  {
      val hadoopConfig = new Configuration()
      val hdfs = FileSystem.get(hadoopConfig)
      FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
    }
    

    If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):

    dataFrame
          .coalesce(1)
          .write
          .format("com.databricks.spark.csv")
          .option("header", "true")
          .save(out)
    

    Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).

    def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
        val tmpDir = "tmpDir"
    
        df.repartition(1)
          .write
          .format("com.databricks.spark.csv")
          .option("header", header.toString)
          .option("delimiter", sep)
          .save(tmpDir)
    
        val dir = new File(tmpDir)
        val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
        (new File(tmpCsvFile)).renameTo(new File(fileName))
    
        dir.listFiles.foreach( f => f.delete )
        dir.delete
    }
    
    0 讨论(0)
  • 2021-01-01 11:59

    Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed). Example:

    val headerSchema = List(
      StructField("example1", StringType, true),
      StructField("example2", StringType, true),
      StructField("example3", StringType, true)
    )
    
    val header_DF =sqlCtx.read
      .option("delimiter", ",")
      .option("header", "false")
      .option("mode","DROPMALFORMED")
      .option("inferSchema","false")
      .schema(StructType(headerSchema))
      .format("com.databricks.spark.csv")
      .load("folder containg the files")
    

    In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.

    0 讨论(0)
  • 2021-01-01 12:00

    you can walk around like this.

    • 1.Create a new DataFrame(headerDF) containing header names.
    • 2.Union it with the DataFrame(dataDF) containing the data.
    • 3.Output the union-ed DataFrame to disk with option("header", "false").
    • 4.merge partition files(part-0000**0.csv) using hadoop FileUtil

    In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following

      //dataFrame is the data to save on disk
      //cast types of all columns to String
      val dataDF = dataFrame.select(dataFrame.columns.map(c => dataFrame.col(c).cast("string")): _*)
    
      //create a new data frame containing only header names
      import scala.collection.JavaConverters._
      val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)
    
      //merge header names with data
      headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)
    
      //use hadoop FileUtil to merge all partition csv files into a single file
      val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
      FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)
    
    0 讨论(0)
提交回复
热议问题