How to handle operations on local files over multiple ParDo transforms in Apache Beam / Google Cloud DataFlow

问题

I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file. The branched results are then combined and exported as text.

This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow.

The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage under control. This is due to the pre-processing module which requires local access to the files. This could be re-engineered to handle a byte stream instead of a file by rewriting some of the pre-processing libraries myself - however, some attempts at this aren't going well and I'd like to explore first how to handle parallelised local file operations in Apache Beam / GC Dataflow in order to understand the framework better.

In this rough implementation each branch downloads and deletes the files, with lots of double handling. In my implementation I have 8 branches, so each file is being downloaded and deleted 8 times. Could a GCS bucket instead be mounted on every worker rather than downloading files from the remote?

Or is there another way to ensure workers are being passed the correct reference to a file so that:

a single DownloadFilesDoFn() can download a batch
then fan out the local file references in PCollection to all the branches
and then a final CleanUpFilesDoFn() can remove them
How can you parallelise local file references?

What is the best branched ParDo strategy for Apache Beam / GC Dataflow if local file operations cannot be avoided?

Some example code of my existing implementation with two branches for simplicity.

# singleton decorator
def singleton(cls):
  instances = {}
  def getinstance():
      if cls not in instances:
          instances[cls] = cls()
      return instances[cls]
  return getinstance

@singleton
class Predict():
  def __init__(self, model):
    '''
    Process audio, reads in filename 
    Returns Prediction
    '''
    self.model = model

  def process(self, filename):
      #simplified pseudocode
      audio = preprocess.load(filename=filename)
      prediction = inference(self.model, audio)
      return prediction

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.localfile, self.model = "", model
    
  def process(self, element):
    # Construct Predict() object singleton per worker
    predict = Predict(self.model)

    subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
    self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]

    res = predict.process(self.localfile)
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]    
  def finish_bundle(self):
    subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)


# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
    def process(self, element):
        Index,Title,GCSPath = element.split(",")
        GCSPath = 'gs://mybucket/'+ GCSPath
        return [{
            'Index': int(Index),
            'Title': Title,
            'GCSPath': GCSPath
        }]

A simplified version of the pipeline:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

回答1:

In order to avoid downloading the files repeatedly, the contents of the files can be put into the pCollection.

class DownloadFilesDoFn(beam.DoFn):
  def __init__(self):
     import re
     self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')

  def start_bundle(self):
     import google.cloud.storage
     self.gcs = google.cloud.storage.Client()

  def process(self, element):
     file_match = self.gcs_path_regex.match(element['GCSPath'])
     bucket = self.gcs.get_bucket(file_match.group(1))
     blob = bucket.get_blob(file_match.group(2))
     element['file_contents'] = blob.download_as_bytes()
     yield element

Then PredictDoFn becomes:

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.model = model

  def start_bundle(self):
    self.predict = Predict(self.model)
    
  def process(self, element):
    res = self.predict.process(element['file_contents'])
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]

and the pipeline:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
          | 'Read files' >> beam.ParDo(DownloadFilesDoFn())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

来源：https://stackoverflow.com/questions/65057273/how-to-handle-operations-on-local-files-over-multiple-pardo-transforms-in-apache

标签

python

google-cloud-platform

google-cloud-dataflow

apache-beam