I installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. Also installed CUDA 8.0.71 and cuDNN 5.1. I am running a tf code that works fine with non C
I encountered this problem when I accidently installed the CUDA 9.2 libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb instead of libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb on a system with CUDA 9.0 installed.
I got there because I had CUDA 9.2 installed and I had downgraded to CUDA 9.0, and evidently libcudnn is specific to versions.
I have managed to get it working by deleting the .nv folder in my home folder:
sudo rm -rf ~/.nv/
In my case, after checking the cuDNN and CUDA version, I found my GPU was out of memory. Using watch -n 0.1 nvidia-smi
in another bash terminal, the moment 2019-07-16 19:54:05.122224: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
onset is the moment GPU memory nearly full.
The screenshot
So I configure a limit for tnsorflow to use my gpu. As I use tf.keras
module, I add the following code to the beginning of my program:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
tf.keras.backend.set_session(tf.Session(config=config));
Then, problem solved!
You can change your batch_size
or using smarter ways to input your training data (such as tf.data.Dataset
and using cache). I hope my answer can help someone else.
It has to do with the memory fraction available to load GPU resources to create cudnn handle, also known as per_process_gpu_memory_fraction
.
Reducing this memory fraction by yourself will solve the error.
> sess_config = tf.ConfigProto(gpu_options =
> tf.GPUOptions(per_process_gpu_memory_fraction=0.7),
> allow_soft_placement = True)
>
> with tf.Session(config=sess_config) as sess:
> sess.run([whatever])
Use as small fraction as could fit in your memory. (In the code, I use 0.7, you can start with 0.3 or even smaller, then increase until you get the same error, that's your limit.)
Pass it to your tf.Session()
or tf.train.MonitoredTrainingSession()
or Supervisor's sv.managed_session()
as config.
This should allow your GPU create a cudnn handle for your TensorFlow code.
In my case it seems that the problem was caused by tensorflow and cudnn version mismatch. The following helped me (I was working on Ubuntu 16.04 with NVidia Tesla K80 on Google Cloud, tensorflow 1.5 finally worked with cudnn 7.0.4 and cuda 9.0):
Remove cuDNN completely:
sudo rm /usr/local/cuda/include/cudnn.h
sudo rm /usr/local/cuda/lib64/libcudnn*
After doing so import tensorflow should cause error.
Download appropriate cuDNN version. Note that there is cuDNN 7.0.4 for CUDA 9.0 and cuDNN 7.0.4 for CUDA 8.0. You should choose the one corresponding to your CUDA version. Be careful at this step or you'll get similar problem again. Install cuDNN as usual:
tar -xzvf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
In this example I've installed cuDNN 7.0.x for CUDA 9.0 (x actually doesn't matter). Take care to match your CUDA version.
Restart the computer. In my case the problem vanished. If the error still occurs consider installing another version of tensorflow.
Hope this helps someone.
Please remember to close your tensorboard terminal/cmd or other terminals, that have interactions to/with the directory. Then you can restart the training at it should work.