I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following err
Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:
The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).
It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.
See also How can I improve performance of TextIO or AvroIO when reading a very large number of files? for a Java version.