Read a compressed file *with custom extension* with spark

后端 未结 1 1780
深忆病人
深忆病人 2021-01-19 16:39

I want to read gzip compressed files into a RDD[String] using an equivalent of sc.textFile(\"path/to/file.Z\").

Except my file extension if

相关标签:
1条回答
  • 2021-01-19 17:37

    Here there's a workaround to fix this problem http://arjon.es/2015/10/02/reading-compressed-data-with-spark-using-unknown-file-extensions/

    The relevant section:

    ...extend GzipCodec and override the getDefaultExtension method.

    package smx.ananke.spark.util.codecs
    
    import org.apache.hadoop.io.compress.GzipCodec
    
    class TmpGzipCodec extends GzipCodec {
    
      override def getDefaultExtension(): String = ".gz.tmp" // You should change it to ".Z"
    
    }
    

    Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:

    val conf = new SparkConf()
    
    // Custom Codec that process .gz.tmp extensions as a common Gzip format
    conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")
    
    val sc = new SparkContext(conf)
    
    val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")
    
    0 讨论(0)
提交回复
热议问题