How to use google cloud storage in dataflow pipeline run from datalab

后端 未结 2 520
心在旅途
心在旅途 2021-01-22 00:51

We\'ve been running a Python pipeline in datalab that reads image files from a bucket in google cloud storage (importing google.datalab.storage). Originally we were using Direct

相关标签:
2条回答
  • 2021-01-22 01:28

    If your only usage of pydatalab is to read from GCS, then I would suggest using Dataflow's gcsio. Code example:

    def read_file(input_tuple):
      filepath = input_tuple[0]
      with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
        # process f content
        pass
    
    # File paths relative to the bucket
    input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
    p = beam.Pipeline(options = options)
    all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
    all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
    p.run()
    

    pydatalab is pretty heavy since it is more of an data exploration library used with Datalab or Jupyter. On the other hand, Dataflow's GCSIO is natively supported in pipeline.

    0 讨论(0)
  • 2021-01-22 01:53

    The most likely issue is that you need to have Dataflow install the datalab pypi module.

    Typically you would do this by listing "datalab" in the requirements.txt file you upload to Dataflow. See https://cloud.google.com/dataflow/pipelines/dependencies-python

    0 讨论(0)
提交回复
热议问题