问题
I created a training job where I fetch my data from big query, perform training and deploy model. I would like to start training automatically in this two cases:
- More than 1000 new rows added to the dataset
- With a schedule (Ex, once a week)
I checked GCP Cloud Scheduler, but it seems its not suitable for my case.
回答1:
Cloud Scheduler is the right tool to trigger your training on a schedule. I don't know what your blocker is!!
For your first point, you can't. You can't put a trigger (on BigQuery or on other database) to send an event after X new rows. For this, I recommend you to do this:
- Schedule a job with Cloud Scheduler (for example every 10 minutes)
- The job perform a request in BigQuery and check the number of line since the last training job (the date of the last training job must be somewhere, I recommend in another BigQuery table)
- If the number of line is > 1000, trigger your running job
- Else, exit the function
As you see, it's not so easy and there is several caveats:
- When you deploy your model, you also have to write the date of the latest training
- You have to perform several times the request into BigQuery. Partition correctly your table for limiting the cost
Does it make sense for you?
EDIT
gcloud command is a "simple" wrapper of API calls. Try to add the param --http-log
to your gcloud command to see which API is called and with which params.
Anyway, you can start a job by calling this API, and if you want and example, use the --http-log
param of gcloud SDK!
回答2:
For anyone looking for solution to submit training job on schedule,Here I am posting my solution after trying few ways.I tried,
- Run through cloud composer using Airflow
- Start job using start script
- Use cron with Cloud scheduler,Pub/Sub and Cloud function
Easiest and most cost effective way is using cloud scheduler and AI-platform client library with cloud function
step 1 - create pub/sub topic (example start-training
)
step 2 - create cron using cloud scheduler targeting start-training
topic
step 3 - create cloud function using trigger type as cloud pub/sub
and topic as start-training
and entry point is submit_job
function.This function submit a training job to AI-platform through python client library.
Now we have this beautiful DAG
Scheduler -> Pub/Sub -> Cloud Function -> AI-platform
cloud function code goes like this
main.py
import datetime
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
id = '<PROJECT ID>'
bucket_name = "<BUCKET NAME>"
project_id = 'projects/{}'.format(id)
job_name = "training_" + datetime.datetime.now().strftime("%y%m%d_%H%M%S")
def submit_job(event, context):
training_inputs = {
'scaleTier': 'BASIC',
'packageUris': [f"gs://{bucket_name}/package/trainer-0.1.tar.gz"],
'pythonModule': 'trainer.task',
'region': 'asia-northeast1',
'jobDir': f"gs://{bucket_name}",
'runtimeVersion': '2.2',
'pythonVersion': '3.7',
}
job_spec = {"jobId":job_name, "trainingInput": training_inputs}
cloudml = discovery.build("ml" , "v1" ,cache_discovery=False)
request = cloudml.projects().jobs().create(body=job_spec,parent=project_id)
response = request.execute()
requirement.txt
google-api-python-client
oauth2client
Important
make sure to use Project_id not Project_name,otherwise it will give permission error
If you get
ImportError:file_cache is unavailable when using oauthclient ....
error usecache_discovery=False
in build function,otherwise leave function to use cache for performance reason.point to correct GCS location to your source package,in this case my package name is
trainer
built and located inpackage
folder in the bucket and main module istask
来源:https://stackoverflow.com/questions/62612079/how-to-start-ai-platform-jobs-automatically