Problem deploying the best estimator gotten with sagemaker.estimator.Estimator (w/ sklearn custom image)

问题

After creating SKLearn() instance and using HyperparamaterTuner with a few hyperparameter ranges, I get the best estimator. When I try to deploy() the estimator, it gives an error in the log. Exactly same error happens when I create transformer and call transform on it(). Doesn't deploy and doesn't transform. What could be the problem and at least how could I possibly narrow down the problem?

I have no idea how to even begin to figure this out. Googling didn't help. Nothing comes up.

Creating SKLearn instance:

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=session,
    hyperparameters={'model': 'rfc'})

Putting tuner to work:

tuner = HyperparameterTuner(estimator = sklearn,
                            objective_metric_name = objective_metric_name,
                            objective_type = 'Minimize',
                            metric_definitions = metric_definitions,
                            hyperparameter_ranges = hyperparameters,
                            max_jobs = 3, # 9,
                            max_parallel_jobs = 4)

tuner.fit({'train': s3_input_train})
tuner.wait()
best_training_job = tuner.best_training_job()
the_best_estimator = sagemaker.estimator.Estimator.attach(best_training_job)

This gives a valid best training job. Everything seems great.

Here is where the problem manifests:

predictor = the_best_estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

or the following (triggers exactly same problem):

rfc_transformer = the_best_estimator.transformer(1, instance_type="ml.m4.xlarge")
rfc_transformer.transform(test_location)
rfc_transformer.wait()

Here is the log with the error message (it reiterates the same error many times while trying to deploy or transform; here is the beginning of the log):

................[2019-09-22 09:17:48 +0000] [17] [INFO] Starting gunicorn 19.9.0

[2019-09-22 09:17:48 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17)

[2019-09-22 09:17:48 +0000] [17] [INFO] Using worker: gevent

[2019-09-22 09:17:48 +0000] [24] [INFO] Booting worker with pid: 24

[2019-09-22 09:17:48 +0000] [25] [INFO] Booting worker with pid: 25

[2019-09-22 09:17:48 +0000] [26] [INFO] Booting worker with pid: 26

[2019-09-22 09:17:48 +0000] [30] [INFO] Booting worker with pid: 30

2019-09-22 09:18:15,061 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)

2019-09-22 09:18:15,062 INFO - sagemaker_sklearn_container.serving - Encountered an unexpected error.

[2019-09-22 09:18:15 +0000] [24] [ERROR] Error handling request /ping

Traceback (most recent call last):

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 56, in handle self.handle_request(listener_name, req, client, addr)

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request addr)

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request respiter = self.wsgi(environ, resp.start_response)

File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 119, in main user_module_transformer = import_module(serving_env.module_name, serving_env.module_dir)

File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 97, in import_module user_module = importlib.import_module(module_name)

File "/usr/lib/python3.5/importlib/init.py", line 117, in import_module if name.startswith('.'):

AttributeError: 'NoneType' object has no attribute 'startswith'

169.254.255.130 - - [22/Sep/2019:09:18:15 +0000] "GET /ping HTTP/1.1" 500 141 "-" "Go-http-client/1.1"

2019-09-22 09:18:15,178 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)

2019-09-22 09:18:15,179 INFO - sagemaker_sklearn_container.serving - Encountered an unexpected error.

[2019-09-22 09:18:15 +0000] [30] [ERROR] Error handling request /ping

Traceback (most recent call last):

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 56, in handle self.handle_request(listener_name, req, client, addr)

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request addr)

File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request respiter = self.wsgi(environ, resp.start_response)

File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 119, in main user_module_transformer = import_module(serving_env.module_name, serving_env.module_dir)

File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 97, in import_module user_module = importlib.import_module(module_name)

File "/usr/lib/python3.5/importlib/init.py", line 117, in import_module if name.startswith('.'):

回答1:

Double check you have the necessary environment variables set. I ran into this issue when I didn't set the environment variables SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT, SAGEMAKER_PROGRAM, and SAGEMAKER_SUBMIT_DIRECTORY. Check a working base model to see what environment variables need to be set.

来源：https://stackoverflow.com/questions/58050712/problem-deploying-the-best-estimator-gotten-with-sagemaker-estimator-estimator

标签

scikit-learn

amazon-sagemaker