I use Spark 1.6.0 and Scala.
I want to save a DataFrame as compressed CSV format.
Here is what I have so far (assume I already have df
and
To write the CSV file with headers and rename the part-000 file to .csv.gzip
DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)
copyRename(tempLocationFileName, finalLocationFileName)
def copyRename(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.
With Spark 2.0+, this has become a bit simpler:
df.write.csv("path", compression="gzip")
You don't need the external Databricks CSV package anymore.
The csv()
writer supports a number of handy options. For example:
sep
: To set the separator character.quote
: Whether and how to quote values.header
: Whether to include a header line.There are also a number of other compression codecs you can use, in addition to gzip
:
bzip2
lz4
snappy
deflate
The full Spark docs for the csv()
writer are here: Python / Scala
This code works for Spark 2.1, where .codec
is not available.
df.write
.format("com.databricks.spark.csv")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save(my_directory)
For Spark 2.2, you can use the df.write.csv(...,codec="gzip")
option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec
Spark 2.2+
df.write.option("compression","gzip").csv("path")
Spark 2.0
df.write.csv("path", compression="gzip")
Spark 1.6
On the spark-csv github: https://github.com/databricks/spark-csv
One can read:
codec
: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.
In this case, this works:
df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')