问题
I am trying to deploy a pretrained pytorch model to AI Platform with a custom prediction routine. After following the instructions described here the deployment fails with the following error:
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
The contents of the model folder are 83.89 MB large and are below the 250 MB limit that's described in the documentation. The only files in the folder are the checkpoint file (.pth) for the model and the tarball required for the custom prediction routine.
Command to create the model:
gcloud beta ai-platform versions create pose_pytorch --model pose --runtime-version 1.15 --python-version 3.5 --origin gs://rcg-models/pytorch_pose_estimation --package-uris gs://rcg-models/pytorch_pose_estimation/my_custom_code-0.1.tar.gz --prediction-class predictor.MyPredictor
Changing the runtime version to 1.14
leads to the same error.
I have tried changing the --machine-type argument to mls1-c4-m2
like Parth suggested but I still get the same error.
The setup.py
file that generates my_custom_code-0.1.tar.gz
looks like this:
setup(
name='my_custom_code',
version='0.1',
scripts=['predictor.py'],
install_requires=["opencv-python", "torch"]
)
Relevant code snippet from the predictor:
def __init__(self, model):
"""Stores artifacts for prediction. Only initialized via `from_path`.
"""
self._model = model
self._client = storage.Client()
@classmethod
def from_path(cls, model_dir):
"""Creates an instance of MyPredictor using the given path.
This loads artifacts that have been copied from your model directory in
Cloud Storage. MyPredictor uses them during prediction.
Args:
model_dir: The local directory that contains the trained Keras
model and the pickled preprocessor instance. These are copied
from the Cloud Storage model directory you provide when you
deploy a version resource.
Returns:
An instance of `MyPredictor`.
"""
net = PoseEstimationWithMobileNet()
checkpoint_path = os.path.join(model_dir, "checkpoint_iter_370000.pth")
checkpoint = torch.load(checkpoint_path, map_location='cpu')
load_state(net, checkpoint)
return cls(net)
Additionally I have enabled logging for the model in AI Platform and I get the following outputs:
2019-12-17T09:28:06.208537Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
2019-12-17T09:28:13.474653Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:48: The name tf.saved_model.tag_constants.SERVING is deprecated. Please use tf.saved_model.SERVING instead.
2019-12-17T09:28:13.474680Z {"textPayload":"","insertId":"5df89fad00073e383ced472a","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474680Z","logName":"projects/rcg-shopper/logs/ml.googleapis…
2019-12-17T09:28:13.474807Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:50: The name tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead.
2019-12-17T09:28:13.474829Z {"textPayload":"","insertId":"5df89fad00073ecd4836d6aa","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474829Z","logName":"projects/rcg-shopper/logs/ml.googleapis…
2019-12-17T09:28:13.474918Z WARNING:tensorflow:
2019-12-17T09:28:13.474927Z The TensorFlow contrib module will not be included in TensorFlow 2.0.
2019-12-17T09:28:13.474934Z For more information, please see:
2019-12-17T09:28:13.474941Z * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
2019-12-17T09:28:13.474951Z * https://github.com/tensorflow/addons
2019-12-17T09:28:13.474958Z * https://github.com/tensorflow/io (for I/O related ops)
2019-12-17T09:28:13.474964Z If you depend on functionality not listed there, please file an issue.
2019-12-17T09:28:13.474999Z {"textPayload":"","insertId":"5df89fad00073f778735d7c3","resource":{"type":"cloudml_model_version","labels":{"version_id":"lightweight_pose_pytorch","model_id":"pose","project_id":"rcg-shopper","region":""}},"timestamp":"2019-12-17T09:28:13.474999Z","logName":"projects/rcg-shopper/logs/ml.googleapis…
2019-12-17T09:28:15.283483Z ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x
2019-12-17T09:28:16.890923Z Copying gs://cml-489210249453-1560169483791188/models/pose/lightweight_pose_pytorch/15316451609316207868/user_code/my_custom_code-0.1.tar.gz...
2019-12-17T09:28:16.891150Z / [0 files][ 0.0 B/ 8.4 KiB]
2019-12-17T09:28:17.007684Z / [1 files][ 8.4 KiB/ 8.4 KiB]
2019-12-17T09:28:17.009154Z Operation completed over 1 objects/8.4 KiB.
2019-12-17T09:28:18.953923Z Processing /tmp/custom_code/my_custom_code-0.1.tar.gz
2019-12-17T09:28:19.808897Z Collecting opencv-python
2019-12-17T09:28:19.868579Z Downloading https://files.pythonhosted.org/packages/d8/38/60de02a4c9013b14478a3f681a62e003c7489d207160a4d7df8705a682e7/opencv_python-4.1.2.30-cp37-cp37m-manylinux1_x86_64.whl (28.3MB)
2019-12-17T09:28:21.537989Z Collecting torch
2019-12-17T09:28:21.552871Z Downloading https://files.pythonhosted.org/packages/f9/34/2107f342d4493b7107a600ee16005b2870b5a0a5a165bdf5c5e7168a16a6/torch-1.3.1-cp37-cp37m-manylinux1_x86_64.whl (734.6MB)
2019-12-17T09:28:52.401619Z Collecting numpy>=1.14.5
2019-12-17T09:28:52.412714Z Downloading https://files.pythonhosted.org/packages/9b/af/4fc72f9d38e43b092e91e5b8cb9956d25b2e3ff8c75aed95df5569e4734e/numpy-1.17.4-cp37-cp37m-manylinux1_x86_64.whl (20.0MB)
2019-12-17T09:28:53.550662Z Building wheels for collected packages: my-custom-code
2019-12-17T09:28:53.550689Z Building wheel for my-custom-code (setup.py): started
2019-12-17T09:28:54.212558Z Building wheel for my-custom-code (setup.py): finished with status 'done'
2019-12-17T09:28:54.215365Z Created wheel for my-custom-code: filename=my_custom_code-0.1-cp37-none-any.whl size=7791 sha256=fd9ecd472a6a24335fd24abe930a4e7d909e04bdc4cf770989143d92e7023f77
2019-12-17T09:28:54.215482Z Stored in directory: /tmp/pip-ephem-wheel-cache-i7sb0bmb/wheels/0d/6e/ba/bbee16521304fc5b017fa014665b9cae28da7943275a3e4b89
2019-12-17T09:28:54.222017Z Successfully built my-custom-code
2019-12-17T09:28:54.650218Z Installing collected packages: numpy, opencv-python, torch, my-custom-code
回答1:
This is a common problem and we understand this is a pain point. Please do the following:
torchvision
hastorch
as dependency and by default, it pullstorch
from pypi.
When deploying the model, even if you point to use custom ai-platform torchvision
packages it will do it, since torchvision
when is built by PyTorch team, it is configured to use torch
as dependency. This torch
dependency from pypi, gives a 720mb file because it includes the GPU units
- To solve #1, you need to build
torchvision
from source and telltorchvision
where you want to gettorch
from, you need to set it to go to thetorch
website as the package is smaller. Rebuild thetorchvision
binary using Python PEP-0440 direct references feature. Intorchvision
setup.py we have:
pytorch_dep = 'torch'
if os.getenv('PYTORCH_VERSION'):
pytorch_dep += "==" + os.getenv('PYTORCH_VERSION')
Update setup.py
in torchvision
to use direct references feature:
requirements = [
#'numpy',
#'six',
#pytorch_dep,
'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl'
]
* I already did this for you*, so I build 3 wheel files you can use:
gs://dpe-sandbox/torchvision-0.4.0-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.0)
gs://dpe-sandbox/torchvision-0.4.2-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.2)
gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl (torch 1.4.0 vision 0.5.0)
These torchvision
packages will get torch
from the torch site instead of pypi: (Example: https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl)
Update your model
setup.py
when deploying the model to AI Platform so it does not includetorch
nortorchvision
.Redeploy the model as follows:
PYTORCH_VISION_PACKAGE=gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl
gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--machine-type=mls1-c4-m4 \
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI},{PYTORCH_VISION_PACKAGE}\
--prediction-class={MODEL_CLASS}
You can change the PYTORCH_VISION_PACKAGE
to any of the options I mentioned in #2
回答2:
I could succeed by tweaking setup.py
. Basically install_requires
try to fetch PyPI hosted torch
package which is a huge GPU built wheel and that is exceeding the deployment quota. The following setup.py
injects install commands that fetches CPU built torch from the official pytorch index.
from setuptools import setup, find_packages
from setuptools.command.install import install as _install
INSTALL_REQUIRES = ['pillow']
CUSTOM_INSTALL_COMMANDS = [
# Install torch here.
[
'python-default', '-m', 'pip', 'install', '--target=/tmp/custom_lib',
'-b', '/tmp/pip_builds', 'torch==1.4.0+cpu', 'torchvision==0.5.0+cpu',
'-f', 'https://download.pytorch.org/whl/torch_stable.html'
],
]
class Install(_install):
def run(self):
import sys
if sys.platform == 'linux':
import subprocess
import logging
for command in CUSTOM_INSTALL_COMMANDS:
logging.info('Custom command: ' + ' '.join(command))
result = subprocess.run(
command, check=True, stdout=subprocess.PIPE
)
logging.info(result.stdout.decode('utf-8', 'ignore'))
_install.run(self)
setup(
name='predictor',
version='0.1',
packages=find_packages(),
install_requires=INSTALL_REQUIRES,
cmdclass={'install': Install},
)
回答3:
After hours of good old trial errors, I came to the same conclusion as @kyamagu, "install_requires
try to fetch PyPI hosted torch package which is a huge GPU built wheel and that is exceeding the deployment quota."
However, his solution did not work for me. So after many more hours of trial errors (thanks to lacking documentation and wrong ones) I came up with this solution:
We need to get the cpu-built wheels of the Pytorch which is around 100 MBs rather than the 700 MBs GPU-builts that are the default PyPI hosted. you can find them here: https://download.pytorch.org/whl/cpu/torch_stable.html
Next, we need to place them in our gs storage and then give the path as part of the --package-uris like this:
gcloud beta ai-platform versions create v17 \
--model=newest \
--origin=gs://bucket \
--runtime-version=1.15 \
--python-version=3.7 \
--package-uris=gs://bucket/predictor-0.1.tar.gz,gs://bucket/torch-1.3.0+cpu-cp37-cp37m-linux_x86_64.whl \
--prediction-class=predictor.MyPredictor \
--machine-type=mls1-c4-m4
Also, watch out for the order of the package-uris
, the predictor
package should be first and there should not be any space after commas.
Hope this helps. cheers!
来源:https://stackoverflow.com/questions/59372655/cannot-deploy-trained-model-to-google-cloud-ai-platform-with-custom-prediction-r