Distributed Tensorflow in Kubeflow - NotFoundError

蓝咒 提交于 2019-12-24 18:44:58

问题


I follow the tutorial for building kubeflow on GCP.

At the last step, after deploying the code and training with CPU.

kustomize build . |kubectl apply -f -

The distributed tensorflow encounter this issue

tensorflow.python.framework.errors_impl.NotFoundError: /tmp/tmprIn1Il/model.ckpt-1_temp_a890dac1971040119aba4921dd5f631a; No such file or directory
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:ps/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv_layer1/conv2d/bias, conv_layer1/conv2d/kernel, conv_layer2/conv2d/bias, conv_layer2/conv2d/kernel, dense/bias, dense/kernel, dense_1/bias, dense_1/kernel, global_step)]]

I found the similar bug report but don't know how to resolve this.


回答1:


From the bug report.

You can work around this problem by using a shared filesystem (e.g. HDFS, GCS, or an NFS mount at the same mount point) on the workers and the parameter servers.

Just put the data on GCS and it work fine.

model.py

import tensorflow_datasets as tfds
import tensorflow as tf

# tfds works in both Eager and Graph modes
tf.enable_eager_execution()

# See available datasets
print(tfds.list_builders())

ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"], data_dir="gs://kubeflow-tf-bucket", batch_size=-1)
ds_train = tfds.as_numpy(ds_train)
ds_test = tfds.as_numpy(ds_test)

(x_train, y_train) = ds_train['image'], ds_train['label']
(x_test, y_test) = ds_test['image'], ds_test['label']
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
print(model.evaluate(x_test, y_test))


来源:https://stackoverflow.com/questions/56322632/distributed-tensorflow-in-kubeflow-notfounderror

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!