Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

后端 未结 1 979
梦毁少年i
梦毁少年i 2021-01-19 05:53

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following err

1条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-19 06:21

    Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

    The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

    It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

    See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.

    0 讨论(0)
提交回复
热议问题