TensorFlow in nvidia-docker: failed call to cuInit: CUDA_ERROR_UNKNOWN

后端未结

关注

 3  1870

I have been working on getting an application that relies on TensorFlow to work as a docker container with nvidia-docker. I have compiled my application on top

相关标签:

3条回答

耶瑟儿～

2020-12-19 11:31

I run tensorflow on my ubuntu16.04 desktop.

I run code with GPU works well days before. But today I cannot find gpu device with below code

import tensorflow as tf from tensorflow.python.client import device_lib as _device_lib with tf.Session() as sess: local_device_protos = _device_lib.list_local_devices() print(local_device_protos) [print(x.name) for x in local_device_protos]

And I realize the below issue , when I run tf.Session()

cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

I check my Nvidia driver in the system details, and nvcc -V, nvida-smi to check driver ,cuda and cudnn. Everything seems well.

Then I went to Additional Drivers to check driver detail, there I find there are many versions of the NVIDIA driver and the latest version selected. But when I first install the driver there is only one.

So I select a old version, and apply the change.

Then I run the tf.Session() the issue is also here. I think I should reboot my computer, after I rebooted it, this issue gone.

sess = tf.Session() 2018-07-01 12:02:41.336648: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-07-01 12:02:41.464166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-07-01 12:02:41.464482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.27GiB 2018-07-01 12:02:41.464494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-07-01 12:02:42.308689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-07-01 12:02:42.308721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-07-01 12:02:42.308729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-07-01 12:02:42.309686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7022 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability:

0 讨论(0)
发布评论:

提交评论
- 加载中...
执笔经年

2020-12-19 11:36

I tried installing nvidia-modrpobe, but still the same error. Then a simple system reboot worked for me

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-19 11:48
Maybe the problem is related to JIT caching files permissions, created by GPU. On linux, by default, cache files were created at ~/.nv/ComputeCache. Setting another directory for JIT cache solves the problem. Just do
```
export CUDA_CACHE_PATH=/tmp/nvidia
```
before running something on GPU.
0 讨论(0)
发布评论:

提交评论
- 加载中...