The accepted answer is wrong (installing nvidia-cuda-toolkit
). By installing the toolkit you are basically installing a second CUDA on top of already installed cuda from the nvidia guide.
The problem turned out to be an issue with symbolic links. Inspiration is from this topic http://queirozf.com/entries/installing-cuda-tk-and-tensorflow-on-a-clean-ubuntu-16-04-install
but the actual resolution is different
So at one point during CuDNN
installation nvidia
tutorial will ask you to do this:
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
The problem with this approach is that copying files with filter libcudnn*
will break the symbolic links of the copied files. Instead, I suggest runnign following command, but it will still break the links:
sudo cp --preserve=links cuda/lib64/libcudnn* /usr/local/cuda/lib64
You can verify the links by running ls -lha libcudnn*
in /usr/local/cuda/lib64
folder. If you happen to not see a picture like this:
lrwxrwxrwx 1 root root 13 May 2 20:02 libcudnn.so -> libcudnn.so.7
lrwxrwxrwx 1 root root 17 May 2 20:02 libcudnn.so.7 -> libcudnn.so.7.6.5
-rwxr-xr-x 1 root root 409M May 2 20:02 libcudnn.so.7.6.5
-rw-r--r-- 1 root root 386M May 2 20:02 libcudnn_static.a
Then you just found the problem. The actual solution is involving doing the following:
sudo rm /usr/local/cuda/lib64/libcudnn.so
sudo rm /usr/local/cuda/lib64/libcudnn.so.7
cd /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.6.5 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
Remove the old "links" and create new ones. Verify the links again with ls -lha libcudnn*
. After that run following command in verbose mode:
sudo ldconfig -v
CHECK the logs. I don't know exactly what it does, but it turned out that it is something very important. Also, if the log says that symbolic link is broken or something along these lines then the tensorflow
will continue to show the error mentioned in the subject.
BONUS! make sure you have following paths appended as the last lines nano ~/.bashrc
export PATH=/usr/local/cuda/bin:/opt/nvidia/nsight-compute/2019.4.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDADIR=/usr/local/cuda${CUDADIR:+:${CUDADIR}}
export CUDA_HOME=/usr/local/cuda
and then run the command source ~/.bashrc
All of the above steps assume that you did NOT use the nvidia-cuda-toolkit
, but instead used nvidia
cuda repo.
Also when installing CUDA make sure you are not targeting the 10.2
. On the momenent of writing TF supports versions up to Cuda 10.1
, so following is the right way of installing the necessary version:
sudo apt-cache policy cuda
sudo apt-get install cuda=10.1.243-1
Verifications by:
nvcc --version
nvidia-smi
EDIT: I found the error that you should AVOID seeing after running the ldconfig
command:
/usr/local/cuda-10.1/targets/x86_64-linux/lib:
...
libnppist.so.10 -> libnppist.so.10.2.0.243
libcuinj64.so.10.1 -> libcuinj64.so.10.1.243
> /sbin/ldconfig.real: /usr/local/cuda-10.1/targets/x86_64-linux/lib /libcudnn.so.7 is not a symbolic link
libcudnn.so.7 -> libcudnn.so.7.6.5
libnppc.so.10 -> libnppc.so.10.2.0.243
libnppicom.so.10 -> libnppicom.so.10.2.0.243
libnvgraph.so.10 -> libnvgraph.so.10.1.243
/usr/lib/x86_64-linux-gnu/libfakeroot:
...
If you see it, then something is still misconfigured.