google-cloud-ml-engine | 易学教程

How do I convert a CloudML Alpha model to a SavedModel?

阅读更多关于 How do I convert a CloudML Alpha model to a SavedModel?

问题 In the alpha release of CloudML's online prediction service, the format for exporting model was: inputs = {"x": x, "y_bytes": y} g.add_to_collection("inputs", json.dumps(inputs)) outputs = {"a": a, "b_bytes": b} g.add_to_collection("outputs", json.dumps(outputs)) I would like to convert this to a SavedModel without retraining my model. How can I do that? 回答1: We can convert this to a SavedModel by importing the old model, creating the Signatures, and re-exporting it. This code is untested,

How do I change the Signatures of my SavedModel without retraining the model?

阅读更多关于 How do I change the Signatures of my SavedModel without retraining the model?

问题 I just finished training my model only to find out that I exported a model for serving that had problems with the signatures. How do I update them? (One common problem is setting the wrong shape for CloudML Engine). 回答1: Don't worry -- you don't need to retrain your model. That said, there is a little work to be done. You're going to create a new (corrected) serving graph, load the checkpoints into that graph, and then export this graph. For example, suppose you add a placeholder, but didn't

tensorflow serving prediction not working with object detection pets example

阅读更多关于 tensorflow serving prediction not working with object detection pets example

问题 I was trying to do predictions on gcloud ml-engine with the tensorflow object detection pets example, but it doesn't work. I created a checkpoint using this example: https://github.com/tensorflow/models/blob/master/object_detection/g3doc/running_pets.md With the help of the tensorflow team, I was able to create an saved_model to upload to the gcloud ml-engine: https://github.com/tensorflow/models/issues/1811 Now, I can upload the model to the gcloud ml-engine. But unfortunately, I'm not able

Distributed Training with tf.estimator resulting in more training steps

阅读更多关于 Distributed Training with tf.estimator resulting in more training steps

问题 I am experimenting with distributed training options on Cloud ML Engine and I observing some peculiar results. I have basically altered the census custom estimator example to contain a slightly different model and changed my loss function to AdamOptimizer as the only real changes. Based on this other thread, my understanding is that any distributed training should be data-parallel asynchronous training which would suggest "If you distribute 10,000 batches among 10 worker nodes, each node

Using a nightly TensorFlow build for training with Cloud ML Engine

阅读更多关于 Using a nightly TensorFlow build for training with Cloud ML Engine

问题 If I need to use a nightly TensorFlow build in a Cloud ML Engine training job, how do I do it? 回答1: Download a nightly build from https://github.com/tensorflow/tensorflow#installation. How to pick the right build: use "Linux CPU-only" or "Linux GPU" depending on whether you need to use GPUs for training, use the Python 2 build. Rename the .whl file, for example mv tensorflow-1.0.1-cp27-cp27mu-linux_x86_64.whl \ tensorflow-1.0.1-cp27-none-linux_x86_64.whl (here we renamed the cpu27mu to none .

Placeholder tensors require a value in ml-engine predict but not local predict

阅读更多关于 Placeholder tensors require a value in ml-engine predict but not local predict

问题 I've been developing a model for use with the cloud ML engine's online prediction service. My model contains a placeholder_with_default tensor that I use to hold a threshold for prediction significance. threshold = tf.placeholder_with_default(0.01, shape=(), name="threshold") I've noticed that when using local predict: gcloud ml-engine local predict --json-instances=data.json --model-dir=/my/model/dir I don't need to supply values for this tensor. e.g. this is a valid input: {"features": ["a"

Keras google cloudml sample: IndexError

阅读更多关于 Keras google cloudml sample: IndexError

问题 I'm trying the keras cloudml sample (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras) and I seem unable to run the cloud training. The local training, both with python and gcloud seem to go well. I've looked for a solution on stackexchange, google and read https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting, but I seem to be the only one with this problem (usually a strong indication the fault is entirely mine!) . In addition to the environment

Why am I getting an “Error loading notebook” error when trying to set up Datalab and do image classification on Cloud ML Engine?

阅读更多关于 Why am I getting an “Error loading notebook” error when trying to set up Datalab and do image classification on Cloud ML Engine?

问题 I am following the tutorial here: https://codelabs.developers.google.com/codelabs/cloud-ml-engine-image-classification/index.html?index=..%2F..%2Findex#0 and it is claiming that it will allow me to do image classification on the google cloud. I follow the instructions but when I get to step 4 where I "Start a datalab notebook". It tells me to open the docs folder in Google Cloud DataLab and then open the file called: Hello World.ipynb. WHen I open this file I get a really weird error that I

Google Cloud ML scipy.misc.imread returning <PIL.JpegImagePlugin.JpegImageFile>

阅读更多关于 Google Cloud ML scipy.misc.imread returning

问题 I am running the following snippet: import tensorflow as tf import scipy.misc from tensorflow.python.lib.io import file_io file = file_io.FileIO('gs://BUCKET/data/celebA/000007.jpg', mode='r') img = scipy.misc.imread(file) If I run that snippet in Cloud Console, I get back a proper array. But when that same snippet runs in Cloud ML, the img object is <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=178x218 at 0x7F1F8F26DA10> This stackoverflow answer suggests that libjpeg was not

"Connection reset by peer on adapted standard ML-Engine object-detection training

阅读更多关于 "Connection reset by peer on adapted standard ML-Engine object-detection training

问题 My goal is to test a custom object-detection training using the Google ML-Engine based on the pet-training example from the Object Detection API. After some successful training cycles (maybe until the first checkpoint, since no checkpoint has been created) ... 15:46:56.784 global step 2257: loss = 0.7767 (1.70 sec/step) 15:46:56.821 global step 2258: loss = 1.3547 (1.13 sec/step) ... I received following error on several object detection training job trials: Error reported to Coordinator: , {