I want to read gzip compressed files into a RDD[String]
using an equivalent of sc.textFile(\"path/to/file.Z\")
.
Except my file extension if
Here there's a workaround to fix this problem http://arjon.es/2015/10/02/reading-compressed-data-with-spark-using-unknown-file-extensions/
The relevant section:
...extend GzipCodec and override the getDefaultExtension method.
package smx.ananke.spark.util.codecs
import org.apache.hadoop.io.compress.GzipCodec
class TmpGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.tmp" // You should change it to ".Z"
}
Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:
val conf = new SparkConf()
// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")
val sc = new SparkContext(conf)
val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")