How to start AI-Platform jobs automatically?

问题

I created a training job where I fetch my data from big query, perform training and deploy model. I would like to start training automatically in this two cases:

More than 1000 new rows added to the dataset
With a schedule (Ex, once a week)

I checked GCP Cloud Scheduler, but it seems its not suitable for my case.

回答1:

Cloud Scheduler is the right tool to trigger your training on a schedule. I don't know what your blocker is!!

For your first point, you can't. You can't put a trigger (on BigQuery or on other database) to send an event after X new rows. For this, I recommend you to do this:

Schedule a job with Cloud Scheduler (for example every 10 minutes)
The job perform a request in BigQuery and check the number of line since the last training job (the date of the last training job must be somewhere, I recommend in another BigQuery table)
- If the number of line is > 1000, trigger your running job
- Else, exit the function

As you see, it's not so easy and there is several caveats:

When you deploy your model, you also have to write the date of the latest training
You have to perform several times the request into BigQuery. Partition correctly your table for limiting the cost

Does it make sense for you?

EDIT

gcloud command is a "simple" wrapper of API calls. Try to add the param --http-log to your gcloud command to see which API is called and with which params.

Anyway, you can start a job by calling this API, and if you want and example, use the --http-log param of gcloud SDK!

回答2:

For anyone looking for solution to submit training job on schedule,Here I am posting my solution after trying few ways.I tried,

Run through cloud composer using Airflow
Start job using start script
Use cron with Cloud scheduler,Pub/Sub and Cloud function

Easiest and most cost effective way is using cloud scheduler and AI-platform client library with cloud function

step 1 - create pub/sub topic (example start-training)

step 2 - create cron using cloud scheduler targeting start-training topic

step 3 - create cloud function using trigger type as cloud pub/sub and topic as start-training and entry point is submit_job function.This function submit a training job to AI-platform through python client library.

Now we have this beautiful DAG

Scheduler -> Pub/Sub -> Cloud Function -> AI-platform

cloud function code goes like this

main.py

import datetime
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

id = '<PROJECT ID>'
bucket_name = "<BUCKET NAME>"
project_id = 'projects/{}'.format(id)
job_name = "training_" + datetime.datetime.now().strftime("%y%m%d_%H%M%S")

def submit_job(event, context):

     training_inputs = {
     'scaleTier': 'BASIC',
     'packageUris': [f"gs://{bucket_name}/package/trainer-0.1.tar.gz"],
     'pythonModule': 'trainer.task',
     'region': 'asia-northeast1',
     'jobDir': f"gs://{bucket_name}",
     'runtimeVersion': '2.2',
     'pythonVersion': '3.7',
          }

     job_spec = {"jobId":job_name, "trainingInput": training_inputs}
     cloudml = discovery.build("ml" , "v1" ,cache_discovery=False)
     request = cloudml.projects().jobs().create(body=job_spec,parent=project_id)
     response = request.execute()

requirement.txt

google-api-python-client
oauth2client

Important

make sure to use Project_id not Project_name,otherwise it will give permission error
If you get ImportError:file_cache is unavailable when using oauthclient .... error use cache_discovery=False in build function,otherwise leave function to use cache for performance reason.
point to correct GCS location to your source package,in this case my package name is trainer built and located in package folder in the bucket and main module is task

来源：https://stackoverflow.com/questions/62612079/how-to-start-ai-platform-jobs-automatically

标签

google-cloud-platform

gcp-ai-platform-training