Creating a cluster before sending a job to dataproc programmatically

问题

I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :

authenticate using a service account
submit a job to a cluster

The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.

I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.

How could I do ?

I'd like something cleaner than just waiting for a few minutes in my script !

Thanks

EDIT : Here's what my code looks like so far :

To launch the job

class EnqueueTaskHandler(webapp2.RequestHandler):
    def get(self):
        task = taskqueue.add(
            url='/run',
            target='worker')

        self.response.write(
            'Task {} enqueued, ETA {}.'.format(task.name, task.eta))

app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)

The job

class CronEventHandler(webapp2.RequestHandler):

    def create_cluster(self, dataproc, project, zone, region, cluster_name):
        zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
        cluster_data = {...}

        dataproc.projects().regions().clusters().create(
            projectId=project,
            region=region,
            body=cluster_data).execute()

    def wait_for_cluster(self, dataproc, project, region, clustername):
        print('Waiting for cluster to run...')
        while True:
            result = dataproc.projects().regions().clusters().get(
            projectId=project,
            region=region,
            clusterName=clustername).execute()
            # Handle exceptions
            if result['status']['state'] != 'RUNNING':
                time.sleep(60)
            else:
                return result

    def wait_for_job(self, dataproc, project, region, job_id):
        print('Waiting for job to finish...')
        while True:
            result = dataproc.projects().regions().jobs().get(
                projectId=project,
                region=region,
                jobId=job_id).execute()
            # Handle exceptions
            print(result['status']['state'])
            if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
                return result
            else:
                time.sleep(60)

    def submit_job(self, dataproc, project, region, clusterName):
        job = {...}
        result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
        return result['reference']['jobId']


    def post(self):
        dataproc = googleapiclient.discovery.build('dataproc', 'v1')

        project = '...'
        region = "..."
        zone = "..."
        clusterName = '...'

        self.create_cluster(dataproc, project, zone, region, clusterName)
        self.wait_for_cluster(dataproc, project, region, clusterName)
        job_id = self.submit_job(dataproc,project,region,clusterName)
        self.wait_for_job(dataproc,project,region,job_id)
        dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
        self.response.write("JOB SENT")

app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)

Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?

回答1:

In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.

回答2:

You can use the Google Cloud Dataproc API to create, delete and list clusters.

The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:

UNKNOWN     The cluster state is unknown.
CREATING    The cluster is being created and set up. It is not ready for use.
RUNNING     The cluster is currently running and healthy. It is ready for use.
ERROR       The cluster encountered an error. It is not ready for use.
DELETING    The cluster is being deleted. It cannot be used.
UPDATING    The cluster is being updated. It continues to accept and process jobs.

To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():

countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if you specified an eta.

eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if the countdown argument is specified. This argument can be time zone-aware or time zone-naive, or set to a time in the past. If the argument is set to None, the default value is now. For pull tasks, no worker can lease the task before the time indicated by the eta argument.

If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.

来源：https://stackoverflow.com/questions/49790130/creating-a-cluster-before-sending-a-job-to-dataproc-programmatically

标签

google-app-engine

google-cloud-platform

google-cloud-dataproc