How to use google cloud storage in dataflow pipeline run from datalab

后端 未结 2 521
心在旅途
心在旅途 2021-01-22 00:51

We\'ve been running a Python pipeline in datalab that reads image files from a bucket in google cloud storage (importing google.datalab.storage). Originally we were using Direct

2条回答
  •  佛祖请我去吃肉
    2021-01-22 01:28

    If your only usage of pydatalab is to read from GCS, then I would suggest using Dataflow's gcsio. Code example:

    def read_file(input_tuple):
      filepath = input_tuple[0]
      with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
        # process f content
        pass
    
    # File paths relative to the bucket
    input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
    p = beam.Pipeline(options = options)
    all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
    all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
    p.run()
    

    pydatalab is pretty heavy since it is more of an data exploration library used with Datalab or Jupyter. On the other hand, Dataflow's GCSIO is natively supported in pipeline.

提交回复
热议问题