问题
I have several GPUs but I only want to use one GPU for my training. I am using following options:
config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
Despite setting / using all these options, all of my GPUs allocate memory and
#processes = #GPUs
How can I prevent this from happening?
Note
- I do not want use set the devices manually and I do not want to set
CUDA_VISIBLE_DEVICES
since I want tensorflow to automatically find the best (an idle) GPU available - When I try to start another
run
it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them) - I am running tensorflow in a docker container:
tensorflow/tensorflow:latest-devel-gpu-py
回答1:
I can offer you a method mask_busy_gpus
defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py
Simplified version of the function:
import subprocess as sp
import os
def mask_unused_gpus(leave_unmasked=1):
ACCEPTABLE_AVAILABLE_MEMORY = 1024
COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"
try:
_output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]
if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
except Exception as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked', e)
Usage:
mask_unused_gpus()
with tf.Session()...
Prerequesities: nvidia-smi
With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.
Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.
回答2:
I had this problem my self. Setting config.gpu_options.allow_growth = True
Did not do the trick, and all of the GPU memory was still consumed by Tensorflow.
The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH
(I found it in
https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)
Setting TF_FORCE_GPU_ALLOW_GROWTH=true
works perfectly.
In the Python code, you can set
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
来源:https://stackoverflow.com/questions/47910681/tensorflow-setting-allow-growth-to-true-does-still-allocate-memory-of-all-my-gp