How to use google cloud storage in dataflow pipeline run from datalab

后端未结

关注

 2  521

心在旅途 2021-01-22 00:51

We\'ve been running a Python pipeline in datalab that reads image files from a bucket in google cloud storage (importing google.datalab.storage). Originally we were using Direct

2条回答

佛祖请我去吃肉 (楼主)

2021-01-22 01:28

If your only usage of pydatalab is to read from GCS, then I would suggest using Dataflow's gcsio. Code example:

def read_file(input_tuple):
  filepath = input_tuple[0]
  with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
    # process f content
    pass

# File paths relative to the bucket
input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
p = beam.Pipeline(options = options)
all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
p.run()

pydatalab is pretty heavy since it is more of an data exploration library used with Datalab or Jupyter. On the other hand, Dataflow's GCSIO is natively supported in pipeline.

0 讨论(0)

查看其它2个回答