问题
I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
- authenticate using a service account
- submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
回答1:
In addition to general polling either through list
or get
requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
回答2:
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list
operation can be (repeatedly) performed after create
and delete
operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list
invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list
operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if you specified an eta.
eta -- A
datetime.datetime
that specifies the absolute earliest time at which the task should run. You cannot specify this argument if the countdown argument is specified. This argument can be time zone-aware or time zone-naive, or set to a time in the past. If the argument is set to None, the default value is now. For pull tasks, no worker can lease the task before the time indicated by the eta argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.
来源:https://stackoverflow.com/questions/49790130/creating-a-cluster-before-sending-a-job-to-dataproc-programmatically