Tf 2: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

落爺英雄遲暮 提交于 2021-01-29 05:54:29

问题


I am getting the above error (Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR) when I execute the code below. I have cheked if my gpu is woking using tf.test.is_gpu_available

# coding: utf-8

import tensorflow as tf
import numpy as np
import keras
from models import *
import os 
import gc 

TF_FORCE_GPU_ALLOW_GROWTH = True

np.random.seed(1000)
#Paths
MODEL_CONF = "../models/conf/"
MODEL_WEIGHTS = "../models/weights/"
#Model informations
N_CLASSES = 3


def load_array(name):
    return np.load(name, allow_pickle = True)


gc.collect()

dirData = "saved_data/"
trainDir = dirData + "train/"

model = AdaptedLeNet((168, 168, 8), N_CLASSES)
model.summary(print_fn=lambda x: print(x + '\n'))

# Compile the model with the specified loss function.
model.compile(optimizer=keras.optimizers.Adam(),
            loss='categorical_crossentropy',
            metrics=['accuracy'])

for filename in os.listdir(trainDir):
    data = load_array(trainDir + filename)

    train = data["a"]
    labels = data["b"].astype(int).reshape(-1) 
    one_hot_targets = np.eye(N_CLASSES)[labels]

    model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)

    gc.collect()

The output of this code is:

Epoch 1/5
2020-04-03 18:50:43.397010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 18:50:43.608330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 18:50:44.274270: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275686: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275747: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
  File "cnnAlert.py", line 62, in <module>
    model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
    return self._call_impl(args, kwargs)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node conv2d_1/convolution (defined at /home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_2350]

Function call stack:
keras_scratch_graph

Some more informations:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1660    Off  | 00000000:01:00.0  On |                  N/A |
| 27%   41C    P8     9W / 120W |    211MiB /  5911MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       989      G   /usr/lib/xorg/Xorg                            78MiB |
|    0      1438      G   cinnamon                                      31MiB |
|    0      8622      G   ...uest-channel-token=16736224539216711033    99MiB |
+-----------------------------------------------------------------------------+
3

How do I solve this error? Can you help me?

EDIT 1

  • CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
  • Host compiler version : GCC 7.5.0
  • Tensorflow: 2.1.0-rc0;
  • CUDNN lib is in my LD_LIBRARY_PATH

回答1:


You might need to set the tensorflow session config.gpu_option.allow_growth to true, which can be done by adding the following to the top of your code:

gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
keras.backend.tensorflow_backend.set_session(sess)



回答2:


There is an answer on a question about TF1.0 which addresses how to do this for TF2. The suggestion from that answer worked for me, so I'll copy it in here. TF2 seems to be moving away from tf.Session, so I tend to prefer this suggestion to the other answer here.

physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)


来源:https://stackoverflow.com/questions/61021287/tf-2-could-not-create-cudnn-handle-cudnn-status-internal-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!