How to load my pickeled ML model from GCS to Dataflow/Apache beam

问题

I've developed an apache beam pipeline locally where I run predictions on a sample file.

Locally on my computer I can load the model like this:

with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid:
     gnb_loaded = cPickle.load(fid)

but when running on google dataflow that doesn't obviously work. I tried changing the path to GS:// but that also obviously does not work.

I also tried this code snippet (from here) that was used to load files:

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())

model = (p
     | "Initialize" >> beam.Create(["gs://bucket/file.pkl"])
     | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
    )

but that doesn't work when wanting to load my model, or atleast I cannot use this model variable to call the predict method.

Should be a pretty straightforward task but I can't seem to find a straightforward answer to this.

回答1:

You can define a ParDo as below

class PerdictOutcome(beam.DoFn):
    """ Format the input to the desired shape"""

    def __init__(self, project=None, bucket_name=None, model_path=None, destination_name=None):
        self._model = None
        self._project = project
        self._bucket_name = bucket_name
        self._model_path = model_path
        self._destination_name = destination_name

    def download_blob(bucket_name=None, source_blob_name=None, project=None, destination_file_name=None):
        """Downloads a blob from the bucket."""
        destination_file_name = source_blob_name
        storage_client = storage.Client(<gs://path">)
        bucket = storage_client.get_bucket(bucket_name)
        blob = bucket.blob(source_blob_name)

        blob.download_to_filename(destination_file_name)
    # Load once or very few times
    def setup(self):
        logging.info(
            "Model Initialization {}".format(self._model_path))
        download_blob(bucket_name=self._bucket_name, source_blob_name=self._model_path,
                      project=self._project, destination_file_name=self._destination_name)
        # unpickle model model
        self._model = pickle.load(open(self._destination_name, 'rb'))

    def process(self, element):
        element["prediction"] = self._model.predict(element["data"])
        return [element]

Then you can invoke this ParDo in your pipeline as below:-

    model = (p
         | "Read Files" >> TextIO...
         | "Run Predictions" >> beam.ParDo(PredictSklearn(project=known_args.bucket_project_id, bucket_name=known_args.bucket_name, model_path=known_args.model_path, destination_name=known_args.destination_name)
      )

来源：https://stackoverflow.com/questions/58836947/how-to-load-my-pickeled-ml-model-from-gcs-to-dataflow-apache-beam

标签

python

google-cloud-platform

google-cloud-dataflow

pickle

apache-beam