CUDA Error: out of memory - Python interpreter utilizes all GPU memory

问题

Even after rebooting the machine, there is >95% of used GPU Memory by python3. Note that memory consumption keeps even if there are no running training scripts, and I've never used keras/tensorflow in the system environment, only with venv or in docker container.

UPDATED: The last activity was the execution of NN test script with the following configurations:

tensorflow==1.14.0 Keras==2.0.3

tf.autograph.set_verbosity(1)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=8)
tf.set_random_seed(1)
session_conf.gpu_options.allow_growth = True
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26       Driver Version: 440.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3    N/A /  N/A |   3981MiB /  4042MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4105      G   /usr/lib/xorg/Xorg                           145MiB |
|    0      4762      C   /usr/bin/python3                            3631MiB |
|    0      4764      G   /usr/bin/gnome-shell                          88MiB |
|    0      5344      G   ...quest-channel-token=8947774662807822104    61MiB |
|    0      6470      G   ...Charm-P/ch-0/191.6605.12/jre64/bin/java     5MiB |
|    0      7200      C   python                                        45MiB |
+-----------------------------------------------------------------------------+

Rebooting in recovery mode, I've tried to run nvidia-smi -r but It doesn't solve the issue.

回答1:

By default Tf allocates GPU memory for the lifetime of a process, not the lifetime of the session object (so memory can linger much longer than the object). That is why memory is lingering after you stop the program. In a lot of cases, using the gpu_options.allow_growth = True parameter is flexible, but it will allocate as much GPU memory needed as the runtime process requires.

To prevent tf.Session from using all of your GPU memory, you can allocate a fixed amount of memory for the total process by changing your gpu_options.allow_growth = True to allow for a defined memory fraction (let's use 50% since your program seems to be able to use a lot of memory) at runtime like:

session_conf.gpu_options.per_process_gpu_memory_fraction = 0.5

This should stop you from reaching the upper limit and cap at ~2GB (since it looks like you have 4GB of GPU).

来源：https://stackoverflow.com/questions/59363874/cuda-error-out-of-memory-python-interpreter-utilizes-all-gpu-memory

标签

python

tensorflow

keras

nvidia