Reading from compressed files in Dataflow

后端 未结 4 1738
半阙折子戏
半阙折子戏 2021-01-14 14:07

Is there a way (or any kind of hack) to read input data from compressed files? My input consists of a few hundreds of files, which are produced as compressed with gzip and u

4条回答
  •  囚心锁ツ
    2021-01-14 14:26

    The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.

    • Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
    • Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
    • Apply your pipeline to this PCollection.

提交回复
热议问题