问题
While training my model for data greater than 20GB in BASIC Tier in Cloud ML my jobs are failing because there is no disk space available in the Cloudml machines and I am not able to find any details in gcloud ml documentations [https://cloud.google.com/ml-engine/docs/tensorflow/machine-types].
Need help in deciding the TIER for my training jobs also the utilisation is very less in Job Details Graphs.
Expand all | Collapse all {
insertId: "1klpt2"
jsonPayload: {
created: 1554434546.3576794
levelname: "ERROR"
lineno: 51
message: "Failed to train : [Errno 28] No space left on device"
pathname: "/root/.local/lib/python3.5/site-
packages/loggerwrapper.py"
}
labels: {
compute.googleapis.com/resource_id: ""
compute.googleapis.com/resource_name: "cmle-training-
10361805218452604847"
compute.googleapis.com/zone: ""
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
}
logName: "projects/backend/logs/master-replica-0"
receiveTimestamp: "2019-03-31T12:32:30.07683Z"
resource: {
labels: {
job_id: ""
project_id: "backend"
task_name: "master-replica-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2019-03-31T12:32:26.357679367Z"
}
回答1:
Solved : This error was coming not because of Storage Space instead coming because of shared memory tmfs. The sklearn fit was consuming all the shared memory while training. Solution : setting JOBLIB_TEMP_FOLDER environment variable , to /tmp solved the problem.
来源:https://stackoverflow.com/questions/55452871/solved-no-space-left-on-device-in-google-cloudml-basic-tier-what-is-the-disk