We\'ve been running a Python pipeline in datalab that reads image files from a bucket in google cloud storage (importing google.datalab.storage). Originally we were using Direct
If your only usage of pydatalab is to read from GCS, then I would suggest using Dataflow's gcsio. Code example:
def read_file(input_tuple):
filepath = input_tuple[0]
with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
# process f content
pass
# File paths relative to the bucket
input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
p = beam.Pipeline(options = options)
all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
p.run()
pydatalab is pretty heavy since it is more of an data exploration library used with Datalab or Jupyter. On the other hand, Dataflow's GCSIO is natively supported in pipeline.