How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

前端 未结 2 802
感动是毒
感动是毒 2020-12-21 14:42

I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.

with open (known_args.file_path         


        
相关标签:
2条回答
  • 2020-12-21 15:15

    open() is the standard Python library function that does not understand Google Cloud Storage paths. You need to use the Beam FileSystems API instead, which is aware of it and of other filesystems supported by Beam.

    0 讨论(0)
  • 2020-12-21 15:19

    If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()):

    class ReadGcsBlobs(beam.DoFn):
        def process(self, element, *args, **kwargs):
            from apache_beam.io.gcp import gcsio
            gcs = gcsio.GcsIO()
            yield (element, gcs.open(element).read())
    
    
    # usage example:
    files = (p
             | "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
             | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
            )
    
    0 讨论(0)
提交回复
热议问题