Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

主宰稳场 提交于 2019-12-20 02:31:25

问题


I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

The trouble shooting page says:

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

Does that mean I have to split my files into smaller batches, rather than import them all at once?

I'm using dataflow python sdk for developing pipelines.


回答1:


Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.



来源:https://stackoverflow.com/questions/45935690/troubleshooting-apache-beam-pipeline-import-errors-boundedsource-objects-is-lar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!