问题
I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.
I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.
Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?
The code I use to process the files is this:
val schema = Encoders.product[RawData].schema
val trackerData = spark
.readStream
.option("delimiter", "\t")
.schema(schema)
.csv(path)
val exceptions = rawCientData
.as[String]
.flatMap(extractExceptions)
.as[ExceptionData]
produced output as expected when path points to csv files. But I would like to use tar gzip files. When I try to place those files at the given path, I do not get any exceptions and batch output tells me
"sources" : [ {
"description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
"startOffset" : null,
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 1095,
"processedRowsPerSecond" : 211.0233185584891
} ],
But I do not get any actual data processed. Console sink looks like this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
回答1:
I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).
Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).
In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz
or alike. Else, csv
alone loads CSV files only.
If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source
to do the un-tar.gz
.
Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.
回答2:
I solved the part of reading .tar.gz (.tgz) files this way: Inspired by this site I created my own TGZ codec
final class DecompressTgzCodec extends CompressionCodec {
override def getDefaultExtension: String = ".tgz"
override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
override def createCompressor(): Compressor = ???
override def getCompressorType: Class[_ <: Compressor] = ???
override def createInputStream(in: InputStream): CompressionInputStream = {
new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
}
override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)
override def createDecompressor(): Decompressor = null
override def getDecompressorType: Class[_ <: Decompressor] = null
final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
def updateStream(): Unit = {
// still have data in stream -> done
if (in.available() <= 0) {
// create stream content from following tar elements one by one
in.getNextTarEntry()
}
}
override def read: Int = {
checkStream()
updateStream()
in.read()
}
override def read(b: Array[Byte], off: Int, len: Int): Int = {
checkStream()
updateStream()
in.read(b, off, len)
}
override def resetState(): Unit = {}
}
}
And registered it for use by spark.
val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)
val spark = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Streaming Example")
.getOrCreate()
Works exactly like I wanted it to do.
来源:https://stackoverflow.com/questions/48034069/how-to-load-tar-gz-files-in-streaming-datasets