How to install CUDA 8.0 in the latest version of Tensorflow (1.0) in AWS p2.xlarge instance, AMI ami-edb11e8d and nvidia drivers up to date (375.39)

问题

I have upgraded to Tensorflow version 1.0 and installed CUDA 8.0 with the cudnn 5.1 version and the nvidia drivers up to date 375.39. My NVIDIA hardware is the one that is on Amazon Web Services using the p2.xlarge instance, a Tesla K-80. My OS is Linux 64-bit.

I get the next error message every time I use the command: tf.Session()

[ec2-user@ip-172-31-7-96 CUDA]$ python
Python 2.7.12 (default, Sep  1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
>>> sess = tf.Session()
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d or %d.%d.%d form for driver version; got "1"
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.39  Tue Jan 31 20:47:00 PST 2017
GCC version:  gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.39.0

I'm completely clueless about how to fix this. I have tried different versions of Nvidia drivers and CUDA but still it does not work.

Any hints will be appreciated.

回答1:

You need to install a NVIDIA Driver and run the CUDA 8.0 installer.

# Requirements
# - NVIDIA Driver - NVIDIA-Linux-x86_64-375.39.run - http://www.nvidia.fr/Download/index.aspx
# - CUDA runfile (local) - cuda_8.0.61_375.26_linux.run - https://developer.nvidia.com/cuda-downloads
# - cudnn-8.0-linux-x64-v5.0-ga.tgz

sudo apt update -y && sudo apt upgrade -y
sudo apt install build-essential linux-image-extra-`uname -r` -y

chmod +x NVIDIA-Linux-x86_64-375.39.run
sudo ./NVIDIA-Linux-x86_64-375.39.run

chmod +x cuda_8.0.61_375.26_linux.run
./cuda_8.0.61_375.26_linux.run --extract=`pwd`/extracts
sudo ./extracts/cuda-linux64-rel-8.0.61-21551265.run

echo -e "export CUDA_HOME=/usr/local/cuda\nexport PATH=\$PATH:\$CUDA_HOME/bin\nexport LD_LIBRARY_PATH=\$LD_LINKER_PATH:\$CUDA_HOME/lib64" >> ~/.bashrc
source .bashrc

tar xf cudnn-8.0-linux-x64-v5.0-ga.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/

回答2:

Uninstall drivers & cuda, then follow the official guide to reinstall.

Run deviceQuery to check that the device is installed properly.

回答3:

You can also try "NVIDIA Volta Deep Learning AMI" with p3 (v100 GPU) instance.

Sign up on https://www.nvidia.com/en-us/gpu-cloud/?ncid=van-gpu-cloud and get your "API Key" to use the AMI free of charge.

EC2/GPU config info: https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

回答4:

The AWS Deep Learning AMI has CUDA 8, 9 and 10 pre-installed and so you should not have to do this installation now.

Reference: https://docs.aws.amazon.com/dlami/latest/devguide/overview-cuda.html

来源：https://stackoverflow.com/questions/42422065/how-to-install-cuda-8-0-in-the-latest-version-of-tensorflow-1-0-in-aws-p2-xlar

标签

Linux

amazon-web-services

cuda

tensorflow

nvidia