Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

后端未结

关注

 1  980

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following err

相关标签:

1条回答

野趣味

2021-01-19 06:21

Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.

0 讨论(0)
发布评论:

提交评论
- 加载中...