问题
I launch a tensorflow task on ML Engine and after about 2 minutes I keep getting an error message "The replica master 0 exited with a non-zero status of 1."
(The task incidentally runs fine with ml-engine local.)
Question: Is there any place or log file where can I see further information on what happened?
The logs viewer just gives the following:
{
insertId: "ibal72g1rxhr63"
logName: "projects/**-***-ml/logs/ml.googleapis.com%2Fcnn180322_170649"
receiveTimestamp: "2018-03-22T17:08:38.344282172Z"
resource: {
labels: {
job_id: "cnn180322_170649"
project_id: "**-***-ml"
task_name: "service"
}
type: "ml_job"
}
severity: "ERROR"
textPayload: "The replica master 0 exited with a non-zero status of 1."
timestamp: "2018-03-22T17:08:38.344282172Z"
}
Thanks in advance for any pointers!
回答1:
The solution to the apparent lack of log files was missing permission to write to the logs.
Under IAM & admin, adding the Logs Writer role the account cloud-ml-service@<project_id>.iam.gserviceaccount.com
solved the problem and enables the master and workers to write log messages to Stackdriver as expected.
For a similar discussion and some additional information, see Stackdriver logs not available for Cloud ML jobs since migration to V2
Thanks to all for giving input!
回答2:
Stackdriver agents can monitor many metrics and give details about ML engine. For more details, please refer here. AFAIK, Normal event logging and Stackdriver agents are the only tools to monitor the ML jobs on GCP.
Please note that Python 2.7 which is used in Tensorflow works with relative imports. It is possible that you locally used Python 3.4 which worked with absolute imports. That is why it worked locally but not on Google Cloud. You can refer to this post to modify your import statement. So, if you include the line “from __future__ import absolute_import”
at the top of your code, before the line “import tensorflow as tf” , your code may work.
来源:https://stackoverflow.com/questions/49434874/tensorflow-on-ml-engine-the-replica-master-0-exited-with-a-non-zero-status-of-1