We\'ve been running a Python pipeline in datalab that reads image files from a bucket in google cloud storage (importing google.datalab.storage). Originally we were using Direct
If your only usage of pydatalab is to read from GCS, then I would suggest using Dataflow's gcsio. Code example:
def read_file(input_tuple):
filepath = input_tuple[0]
with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
# process f content
pass
# File paths relative to the bucket
input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
p = beam.Pipeline(options = options)
all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
p.run()
pydatalab is pretty heavy since it is more of an data exploration library used with Datalab or Jupyter. On the other hand, Dataflow's GCSIO is natively supported in pipeline.
The most likely issue is that you need to have Dataflow install the datalab pypi module.
Typically you would do this by listing "datalab" in the requirements.txt file you upload to Dataflow. See https://cloud.google.com/dataflow/pipelines/dependencies-python