Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1

馋奶兔 提交于 2020-02-23 04:07:13

问题


I launch a tensorflow task on ML Engine and after about 2 minutes I keep getting an error message "The replica master 0 exited with a non-zero status of 1."

(The task incidentally runs fine with ml-engine local.)

Question: Is there any place or log file where can I see further information on what happened?

The logs viewer just gives the following:

{
 insertId:  "ibal72g1rxhr63"  
 logName:  "projects/**-***-ml/logs/ml.googleapis.com%2Fcnn180322_170649"  
 receiveTimestamp:  "2018-03-22T17:08:38.344282172Z"  
 resource: {
  labels: {
   job_id:  "cnn180322_170649"    
   project_id:  "**-***-ml"    
   task_name:  "service"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 textPayload:  "The replica master 0 exited with a non-zero status of 1."  
 timestamp:  "2018-03-22T17:08:38.344282172Z"  
}

Thanks in advance for any pointers!


回答1:


The solution to the apparent lack of log files was missing permission to write to the logs.

Under IAM & admin, adding the Logs Writer role the account cloud-ml-service@<project_id>.iam.gserviceaccount.com solved the problem and enables the master and workers to write log messages to Stackdriver as expected.

For a similar discussion and some additional information, see Stackdriver logs not available for Cloud ML jobs since migration to V2

Thanks to all for giving input!




回答2:


Stackdriver agents can monitor many metrics and give details about ML engine. For more details, please refer here. AFAIK, Normal event logging and Stackdriver agents are the only tools to monitor the ML jobs on GCP.

Please note that Python 2.7 which is used in Tensorflow works with relative imports. It is possible that you locally used Python 3.4 which worked with absolute imports. That is why it worked locally but not on Google Cloud. You can refer to this post to modify your import statement. So, if you include the line “from __future__ import absolute_import” at the top of your code, before the line “import tensorflow as tf” , your code may work.



来源:https://stackoverflow.com/questions/49434874/tensorflow-on-ml-engine-the-replica-master-0-exited-with-a-non-zero-status-of-1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!