问题
Today, we updated the GPU driver for our host machine, and the new containers that we created are all working fine. However, all of our existing docker containers give the following error when running the nvidia-smi
command inside:
Failed to initialize NVML: Driver/library version mismatch
How to rescue these old containers? Our previous GPU driver version in the host machine was 384.125
and it is now 430.64
.
Host Configuration
nvidia-smi
gives
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... Off | 00000000:07:00.0 On | 0 |
| N/A 40C P0 39W / 300W | 182MiB / 32505MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... Off | 00000000:08:00.0 Off | 0 |
| N/A 40C P0 39W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... Off | 00000000:0E:00.0 Off | 0 |
| N/A 39C P0 40W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... Off | 00000000:0F:00.0 Off | 0 |
| N/A 40C P0 38W / 300W | 12MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1583 G /usr/lib/xorg/Xorg 169MiB |
+-----------------------------------------------------------------------------+
nvcc --version
gives
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
dpkg -l | grep -i docker
gives
ii dgx-docker-cleanup 1.0-1 amd64 DGX Docker cleanup script
rc dgx-docker-options 1.0-7 amd64 DGX docker daemon options
ii dgx-docker-repo 1.0-1 amd64 docker repository configuration file
ii docker-ce 5:18.09.2~3-0~ubuntu-xenial amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:18.09.2~3-0~ubuntu-xenial amd64 Docker CLI: the open-source application container engine
ii nvidia-container-runtime 2.0.0+docker18.09.2-1 amd64 NVIDIA container runtime
ii nvidia-docker 1.0.1-1 amd64 NVIDIA Docker container tools
rc nvidia-docker2 2.0.3+docker18.09.2-1 all nvidia-docker CLI wrapper
docker version
gives
Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:50 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:42:13 2019
OS/Arch: linux/amd64
Experimental: false
来源:https://stackoverflow.com/questions/63079329/old-docker-containers-are-not-usable-no-gpu-after-updating-the-gpu-driver-in-t