Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

问题

I have several GPUs but I only want to use one GPU for my training. I am using following options:

config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True

with tf.Session(config=config) as sess:

Despite setting / using all these options, all of my GPUs allocate memory and

#processes = #GPUs

How can I prevent this from happening?

Note

I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want tensorflow to automatically find the best (an idle) GPU available
When I try to start another run it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them)
I am running tensorflow in a docker container: tensorflow/tensorflow:latest-devel-gpu-py

回答1:

I can offer you a method mask_busy_gpus defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py

Simplified version of the function:

import subprocess as sp
import os

def mask_unused_gpus(leave_unmasked=1):
  ACCEPTABLE_AVAILABLE_MEMORY = 1024
  COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"

  try:
    _output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
    memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]

    if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
  except Exception as e:
    print('"nvidia-smi" is probably not installed. GPUs are not masked', e)

Usage:

mask_unused_gpus()
with tf.Session()...

Prerequesities: nvidia-smi

With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.

Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.

回答2:

I had this problem my self. Setting config.gpu_options.allow_growth = True Did not do the trick, and all of the GPU memory was still consumed by Tensorflow. The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH (I found it in https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)

Setting TF_FORCE_GPU_ALLOW_GROWTH=true works perfectly.

In the Python code, you can set

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

来源：https://stackoverflow.com/questions/47910681/tensorflow-setting-allow-growth-to-true-does-still-allocate-memory-of-all-my-gp

标签

tensorflow

deep-learning

gpu

nvidia