Why does my ML model deployment in Azure Container Instance still fail?

心不动则不痛 提交于 2019-12-01 11:00:01


I am using Azure Machine Learning Service to deploy a ML model as web service.

I registered a model and now would like to deploy it as an ACI web service as in the guide.

To do so I define

from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage

aciconfig = AciWebservice.deploy_configuration(cpu_cores=4, 
                      tags={"data": "text",  "method" : "NB"}, 
                      description='Predict something')


image_config = ContainerImage.image_configuration(execution_script="score.py", 

and create an image with

image = ContainerImage.create(name = "scorer-image",
                      models = [model],
                      image_config = image_config,
                      workspace = ws

Image creation succeeds with

Creating image Image creation operation finished for image scorer-image:5, operation "Succeeded"

Also, troubleshooting the image by running it locally on an Azure VM with

sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5

allows me to run (locally) queries successfully against http://localhost:8002/score.

However, deployment with

service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
                                        image = image,
                                        name = service_name,
                                        workspace = ws)

fails with

Creating service
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning

I tried setting in the aciconfig more generous memory_gb, but to no avail: the deployment stays in a transitioning state (like in the image below if monitored on the Azure portal):

Also, running service.get_logs() gives me

WebserviceException: Received bad response from Model Management Service: Response Code: 404

What could possibly be the culprit?


If ACI deployment fails, one solution is trying to allocate less resources, e.g.

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                  tags={"data": "text",  "method" : "NB"}, 
                  description='Predict something')

While the error messages thrown are not particularly informative, this is actually clearly stated in the documentation:

When a region is under heavy load, you may experience a failure when deploying instances. To mitigate such a deployment failure, try deploying instances with lower resource settings [...]

The documentation also states which are the maximum values of the CPU/RAM resources available in the different regions (at the time of writing, requiring a deployment with memory_gb=32 would likely fail in all regions because of insufficient resources).

Upon requiring less resources, deployment should succeed with

Creating service
SucceededACI service creation operation finished, operation
"Succeeded" Healthy


I have same problem but the above solution does not work for me. Besides I get additional errors like belos

code": "AciDeploymentFailed",
"message": "Aci Deployment failed with exception: Your container application 
crashed. This may be caused by errors in your scoring file's init() 
function.\nPlease check the logs for your container instance: anomaly-detection-2. 
From the AML SDK, you can run print(service.get_logs()) if you have service object 
to fetch the logs. \nYou can also try to run image 
2@sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally. 
Please refer to http://aka.ms/debugimage#service-launch-fails for more 
"details": [
  "code": "CrashLoopBackOff",
  "message": "Your container application crashed. This may be caused by errors in 
your scoring file's init() function.\nPlease check the logs for your container 
instance: anomaly-detection-2. From the AML SDK, you can run 
print(service.get_logs()) if you have service object to fetch the logs. \nYou can 
also try to run image mlad046a4688.azurecr.io/anomaly-detection- 
2@sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally. 
Please refer to http://aka.ms/debugimage#service-launch-fails for more 

It keeps pointing to scoring file but not sure what is wrong here

import numpy as np
import os
import pickle
import joblib
#from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from azureml.core.authentication import AzureCliAuthentication
from azureml.core import Model,Workspace
import logging


    def init():
    global model
    from sklearn.externals import joblib
    # retrieve the path to the model file using the model name
    model_path = Model.get_model_path(model_name='admlpkl')
    model = joblib.load(model_path)
    #ws = Workspace.from_config(auth=cli_auth)
    #modeld = ws.models['admlpkl']
    #model=Model.deserialize(ws, modeld)

def run(raw_data):
    # data = np.array(json.loads(raw_data)['data'])
    # make prediction
    data = json.loads(raw_data)
    y_hat = model.predict(data)
    #r = json.dumps(y_hat.tolist())
    r = json.dumps(y_hat)
    return r

The model has depencency on other file which I have added in

image_config = ContainerImage.image_configuration(execution_script="score.py", 

The logs are too abstract and really does not help to debug.I am able to create the image but provisioning service fails

Any inputs will be appreciated

