Troubleshooting apache beam pipeline import errors [BoundedSource objects is larger than the allowable limit]

问题

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

The trouble shooting page says:

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

Does that mean I have to split my files into smaller batches, rather than import them all at once?

I'm using dataflow python sdk for developing pipelines.

回答1:

Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.

来源：https://stackoverflow.com/questions/45935690/troubleshooting-apache-beam-pipeline-import-errors-boundedsource-objects-is-lar

标签

python

google-cloud-storage

google-cloud-dataflow

dataflow

apache-beam