问题
I am using Azure Machine Learning Service to deploy a ML model as web service.
I registered a model and now would like to deploy it as an ACI web service as in the guide.
To do so I define
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage
aciconfig = AciWebservice.deploy_configuration(cpu_cores=4,
memory_gb=32,
tags={"data": "text", "method" : "NB"},
description='Predict something')
and
image_config = ContainerImage.image_configuration(execution_script="score.py",
docker_file="Dockerfile",
runtime="python",
conda_file="myenv.yml")
and create an image with
image = ContainerImage.create(name = "scorer-image",
models = [model],
image_config = image_config,
workspace = ws
)
Image creation succeeds with
Creating image Image creation operation finished for image scorer-image:5, operation "Succeeded"
Also, troubleshooting the image by running it locally on an Azure VM with
sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5
allows me to run (locally) queries successfully against http://localhost:8002/score
.
However, deployment with
service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
image = image,
name = service_name,
workspace = ws)
fails with
Creating service
Running.
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning
I tried setting in the aciconfig
more generous memory_gb
, but to no avail: the deployment stays in a transitioning state (like in the image below if monitored on the Azure portal):
Also, running service.get_logs()
gives me
WebserviceException: Received bad response from Model Management Service: Response Code: 404
What could possibly be the culprit?
回答1:
If ACI deployment fails, one solution is trying to allocate less resources, e.g.
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
memory_gb=8,
tags={"data": "text", "method" : "NB"},
description='Predict something')
While the error messages thrown are not particularly informative, this is actually clearly stated in the documentation:
When a region is under heavy load, you may experience a failure when deploying instances. To mitigate such a deployment failure, try deploying instances with lower resource settings [...]
The documentation also states which are the maximum values of the CPU/RAM resources available in the different regions (at the time of writing, requiring a deployment with memory_gb=32
would likely fail in all regions because of insufficient resources).
Upon requiring less resources, deployment should succeed with
Creating service
Running......................................................
SucceededACI service creation operation finished, operation
"Succeeded" Healthy
回答2:
I have same problem but the above solution does not work for me. Besides I get additional errors like belos
code": "AciDeploymentFailed",
"message": "Aci Deployment failed with exception: Your container application
crashed. This may be caused by errors in your scoring file's init()
function.\nPlease check the logs for your container instance: anomaly-detection-2.
From the AML SDK, you can run print(service.get_logs()) if you have service object
to fetch the logs. \nYou can also try to run image
mlad046a4688.azurecr.io/anomaly-detection-
2@sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information.",
"details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in
your scoring file's init() function.\nPlease check the logs for your container
instance: anomaly-detection-2. From the AML SDK, you can run
print(service.get_logs()) if you have service object to fetch the logs. \nYou can
also try to run image mlad046a4688.azurecr.io/anomaly-detection-
2@sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information."
}
]
}
It keeps pointing to scoring file but not sure what is wrong here
import numpy as np
import os
import pickle
import joblib
#from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from azureml.core.authentication import AzureCliAuthentication
from azureml.core import Model,Workspace
import logging
logging.basicConfig(level=logging.DEBUG)
def init():
global model
from sklearn.externals import joblib
# retrieve the path to the model file using the model name
model_path = Model.get_model_path(model_name='admlpkl')
print(model_path)
model = joblib.load(model_path)
#ws = Workspace.from_config(auth=cli_auth)
#logging.basicConfig(level=logging.DEBUG)
#modeld = ws.models['admlpkl']
#model=Model.deserialize(ws, modeld)
def run(raw_data):
# data = np.array(json.loads(raw_data)['data'])
# make prediction
data = json.loads(raw_data)
y_hat = model.predict(data)
#r = json.dumps(y_hat.tolist())
r = json.dumps(y_hat)
return r
The model has depencency on other file which I have added in
image_config = ContainerImage.image_configuration(execution_script="score.py",
runtime="python",
conda_file='conda_dependencies.yml',
dependencies=['modeling.py']
The logs are too abstract and really does not help to debug.I am able to create the image but provisioning service fails
Any inputs will be appreciated
来源:https://stackoverflow.com/questions/55347910/why-does-my-ml-model-deployment-in-azure-container-instance-still-fail