How to load tar.gz files in streaming datasets?

懵懂的女人 提交于 2021-01-01 03:50:55

问题


I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.

I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.

Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?

The code I use to process the files is this:

val schema = Encoders.product[RawData].schema
val trackerData = spark
  .readStream
  .option("delimiter", "\t")
  .schema(schema)
  .csv(path)
val exceptions = rawCientData
  .as[String]
  .flatMap(extractExceptions)
  .as[ExceptionData]

produced output as expected when path points to csv files. But I would like to use tar gzip files. When I try to place those files at the given path, I do not get any exceptions and batch output tells me

  "sources" : [ {
    "description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
    "startOffset" : null,
    "endOffset" : {
      "logOffset" : 0
    },
    "numInputRows" : 1095,
    "processedRowsPerSecond" : 211.0233185584891
  } ],

But I do not get any actual data processed. Console sink looks like this:

+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+

回答1:


I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).

Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).

In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz or alike. Else, csv alone loads CSV files only.

If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source to do the un-tar.gz.

Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.




回答2:


I solved the part of reading .tar.gz (.tgz) files this way: Inspired by this site I created my own TGZ codec

final class DecompressTgzCodec extends CompressionCodec {
  override def getDefaultExtension: String = ".tgz"

  override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
  override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
  override def createCompressor(): Compressor = ???
  override def getCompressorType: Class[_ <: Compressor] = ???

  override def createInputStream(in: InputStream): CompressionInputStream = {
    new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
  }
  override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)

  override def createDecompressor(): Decompressor = null
  override def getDecompressorType: Class[_ <: Decompressor] = null

  final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
    def updateStream(): Unit = {
      // still have data in stream -> done
      if (in.available() <= 0) {
        // create stream content from following tar elements one by one
        in.getNextTarEntry()
      }
    }

    override def read: Int = {
      checkStream()
      updateStream()
      in.read()
    }

    override def read(b: Array[Byte], off: Int, len: Int): Int = {
      checkStream()
      updateStream()
      in.read(b, off, len)
    }

    override def resetState(): Unit = {}
  }
}

And registered it for use by spark.

val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)

val spark = SparkSession
  .builder()
  .master("local[*]")
  .config(conf)
  .appName("Streaming Example")
  .getOrCreate()

Works exactly like I wanted it to do.



来源:https://stackoverflow.com/questions/48034069/how-to-load-tar-gz-files-in-streaming-datasets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!