How to save a DataFrame as compressed (gzipped) CSV?

前端 未结 4 1921
感情败类
感情败类 2020-12-30 23:09

I use Spark 1.6.0 and Scala.

I want to save a DataFrame as compressed CSV format.

Here is what I have so far (assume I already have df and

相关标签:
4条回答
  • 2020-12-30 23:47

    To write the CSV file with headers and rename the part-000 file to .csv.gzip

    DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
    .option("header","true")
    .option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)
    
    copyRename(tempLocationFileName, finalLocationFileName)
    
    def copyRename(srcPath: String, dstPath: String): Unit =  {
      val hadoopConfig = new Configuration()
      val hdfs = FileSystem.get(hadoopConfig)
      FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
      // the "true" setting deletes the source files once they are merged into the new output
    }
    

    If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

    0 讨论(0)
  • 2020-12-30 23:49

    With Spark 2.0+, this has become a bit simpler:

    df.write.csv("path", compression="gzip")
    

    You don't need the external Databricks CSV package anymore.

    The csv() writer supports a number of handy options. For example:

    • sep: To set the separator character.
    • quote: Whether and how to quote values.
    • header: Whether to include a header line.

    There are also a number of other compression codecs you can use, in addition to gzip:

    • bzip2
    • lz4
    • snappy
    • deflate

    The full Spark docs for the csv() writer are here: Python / Scala

    0 讨论(0)
  • 2020-12-30 23:52

    This code works for Spark 2.1, where .codec is not available.

    df.write
      .format("com.databricks.spark.csv")
      .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
      .save(my_directory)
    

    For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

    0 讨论(0)
  • 2020-12-30 23:57

    Spark 2.2+

    df.write.option("compression","gzip").csv("path")

    Spark 2.0

    df.write.csv("path", compression="gzip")

    Spark 1.6

    On the spark-csv github: https://github.com/databricks/spark-csv

    One can read:

    codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

    In this case, this works: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

    0 讨论(0)
提交回复
热议问题