I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.
I have a Scala script that takes raw data from S3, proc
// Convert JavaRDD to CSV and save as text file
outputDataframe.write()
.format("com.databricks.spark.csv")
// Header => true, will enable to have header in each file
.option("header", "true")
Please follow the link with Integration test on how to write a single header
http://bytepadding.com/big-data/spark/write-a-csv-text-file-from-spark/
We had a similar issue, following the below approach to get single output file-
coalesce
or repartition
(after the transformations).dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
coalesce(1)
.dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.
To merge files in a folder into one file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):
dataFrame
.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out)
Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).
def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
val tmpDir = "tmpDir"
df.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", header.toString)
.option("delimiter", sep)
.save(tmpDir)
val dir = new File(tmpDir)
val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
(new File(tmpCsvFile)).renameTo(new File(fileName))
dir.listFiles.foreach( f => f.delete )
dir.delete
}
Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed). Example:
val headerSchema = List(
StructField("example1", StringType, true),
StructField("example2", StringType, true),
StructField("example3", StringType, true)
)
val header_DF =sqlCtx.read
.option("delimiter", ",")
.option("header", "false")
.option("mode","DROPMALFORMED")
.option("inferSchema","false")
.schema(StructType(headerSchema))
.format("com.databricks.spark.csv")
.load("folder containg the files")
In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.
you can walk around like this.
In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following
//dataFrame is the data to save on disk
//cast types of all columns to String
val dataDF = dataFrame.select(dataFrame.columns.map(c => dataFrame.col(c).cast("string")): _*)
//create a new data frame containing only header names
import scala.collection.JavaConverters._
val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)
//merge header names with data
headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)
//use hadoop FileUtil to merge all partition csv files into a single file
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)