Merge Spark output CSV files with a single header

后端 未结 6 788
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

  • 2021-01-01 11:42
     // Convert JavaRDD  to CSV and save as text file
                    // Header => true, will enable to have header in each file
                    .option("header", "true")

    Please follow the link with Integration test on how to write a single header

    0 讨论(0)
  • 2021-01-01 11:45

    We had a similar issue, following the below approach to get single output file-

    1. Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
    dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
    1. Read the files from the previous step and write back to different location on hdfs with coalesce(1).
    dataframe ='header', 'true').csv(hdfs_path_for_multiple_files)
    dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)

    This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.

    0 讨论(0)
  • 2021-01-01 11:55
    1. Output the header using dataframe.schema ( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
    2. create a file with the header on dsefs
    3. append all the partition files (headerless) to the file in #2 using hadoop Filesystem API
    0 讨论(0)
  • 2021-01-01 11:58

    To merge files in a folder into one file:

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    def merge(srcPath: String, dstPath: String): Unit =  {
      val hadoopConfig = new Configuration()
      val hdfs = FileSystem.get(hadoopConfig)
      FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)

    If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):

          .option("header", "true")

    Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).

    def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
        val tmpDir = "tmpDir"
          .option("header", header.toString)
          .option("delimiter", sep)
        val dir = new File(tmpDir)
        val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
        (new File(tmpCsvFile)).renameTo(new File(fileName))
        dir.listFiles.foreach( f => f.delete )
    0 讨论(0)
  • 2021-01-01 11:59

    Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed). Example:

    val headerSchema = List(
      StructField("example1", StringType, true),
      StructField("example2", StringType, true),
      StructField("example3", StringType, true)
    val header_DF
      .option("delimiter", ",")
      .option("header", "false")
      .load("folder containg the files")

    In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.

    0 讨论(0)
  • 2021-01-01 12:00

    you can walk around like this.

    • 1.Create a new DataFrame(headerDF) containing header names.
    • 2.Union it with the DataFrame(dataDF) containing the data.
    • 3.Output the union-ed DataFrame to disk with option("header", "false").
    • 4.merge partition files(part-0000**0.csv) using hadoop FileUtil

    In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following

      //dataFrame is the data to save on disk
      //cast types of all columns to String
      val dataDF = => dataFrame.col(c).cast("string")): _*)
      //create a new data frame containing only header names
      import scala.collection.JavaConverters._
      val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)
      //merge header names with data
      headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)
      //use hadoop FileUtil to merge all partition csv files into a single file
      val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
      FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)
    0 讨论(0)