Reading from compressed files in Dataflow

后端 未结 4 1746
半阙折子戏
半阙折子戏 2021-01-14 14:07

Is there a way (or any kind of hack) to read input data from compressed files? My input consists of a few hundreds of files, which are produced as compressed with gzip and u

4条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-14 14:43

    Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:

    TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
    

    However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

提交回复
热议问题