How does one train multiple models in a single script in TensorFlow when there are GPUs present?

前端 未结 4 1123
暗喜
暗喜 2021-01-30 14:36

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM a

相关标签:
4条回答
  • 2021-01-30 14:38

    An easy solution: Give each model a unique session and graph.

    It works for this platform: TensorFlow 1.12.0, Keras 2.1.6-tf, Python 3.6.7, Jupyter Notebook.

    Key code:

    with session.as_default():
        with session.graph.as_default():
            # do something about an ANN model
    

    Full code:

    import tensorflow as tf
    from tensorflow import keras
    import gc
    
    def limit_memory():
        """ Release unused memory resources. Force garbage collection """
        keras.backend.clear_session()
        keras.backend.get_session().close()
        tf.reset_default_graph()
        gc.collect()
        #cfg = tf.ConfigProto()
        #cfg.gpu_options.allow_growth = True
        #keras.backend.set_session(tf.Session(config=cfg))
        keras.backend.set_session(tf.Session())
        gc.collect()
    
    
    def create_and_train_ANN_model(hyper_parameter):
        print('create and train my ANN model')
        info = { 'result about this ANN model' }
        return info
    
    for i in range(10):
        limit_memory()        
        session = tf.Session()
        keras.backend.set_session(session)
        with session.as_default():
            with session.graph.as_default():   
                hyper_parameter = { 'A set of hyper-parameters' }  
                info = create_and_train_ANN_model(hyper_parameter)      
        limit_memory()
    

    Inspired by this link: Keras (Tensorflow backend) Error - Tensor input_1:0, specified in either feed_devices or fetch_devices was not found in the Graph

    0 讨论(0)
  • 2021-01-30 14:40

    You probably don't want to do this.

    If you run thousands and thousands of models on your data, and pick the one that evaluates best, you are not doing machine learning; instead you are memorizing your data set, and there is no guarantee that the model you pick will perform at all outside that data set.

    In other words, that approach is similar to having a single model, which has thousands of degrees of liberty. Having a model with such high order of complexity is problematic, since it will be able to fit your data better than is actually warranted; such a model is annoyingly able to memorize any noise (outliers, measurement errors, and such) in your training data, which causes the model to perform poorly when the noise is even slightly different.

    (Apologies for posting this as an answer, the site wouldn't let me add a comment.)

    0 讨论(0)
  • 2021-01-30 14:49

    I think that running all models in one single script can be bad practice in the long term (see my suggestion below for a better alternative). However, if you would like to do it, here is a solution: You can encapsulate your TF session into a process with the multiprocessing module, this will make sure TF releases the session memory once the process is done. Here is a code snippet:

    from multiprocessing import Pool
    import contextlib
    def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
        < your code >
    
    num_pool_worker=1 # can be bigger than 1, to enable parallel execution 
    with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
         pool_results = po.map_async(my_model,
                                        ((param1, param2, param3)
                                         for param1, param2, param3 in params_list))
         results_list = pool_results.get()
    

    Note from OP: The random number generator seed does not reset automatically with the multi-processing library if you choose to use it. Details here: Using python multiprocessing with different random seed for each process

    About TF resource allocation: Usually TF allocates much more resources than it needs. Many times you can restrict each process to use a fraction of the total GPU memory, and discover through trial and error the fraction your script requires.

    You can do it with the following snippet

    gpu_memory_fraction = 0.3 # Choose this number through trial and error
    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
    session_config = tf.ConfigProto(gpu_options=gpu_options)
    sess = tf.Session(config=session_config, graph=graph)
    

    Note that sometimes TF increases the memory usage in order to accelerate the execution. Therefore, reducing the memory usage might make your model run slower.

    Answers to the new questions in your edit/comments:

    1. Yes, Tensorflow will be re-allocated every time a new process is created, and cleared once a process ends.

    2. The for-loop in your edit should also do the job. I suggest to use Pool instead, because it will enable you to run several models concurrently on a single GPU. See my notes about setting gpu_memory_fraction and "choosing the maximal number of processes". Also note that: (1) The Pool map runs the loop for you, so you don't need an outer for-loop once you use it. (2) In your example, you should have something like mdl=get_model(args) before calling train()

    3. Weird tuple parenthesis: Pool only accepts a single argument, therefore we use a tuple to pass multiple arguments. See multiprocessing.pool.map and function with two arguments for more details. As suggested in one answer, you can make it more readable with

      def train_mdl(params):
          (x,y)=params
          < your code >
      
    4. As @Seven suggested, you can use CUDA_VISIBLE_DEVICES environment variable to choose which GPU to use for your process. You can do it from within your python script using the following on the beginning of the process function (train_mdl).

      import os # the import can be on the top of the python script
      os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id)
      

    A better practice for executing your experiments would be to isolate your training/evaluation code from the hyper parameters/ model search code. E.g. have a script named train.py, which accepts a specific combination of hyper parameters and references to your data as arguments, and executes training for a single model.

    Then, to iterate through the all the possible combinations of parameters you can use a simple task (jobs) queue, and submit all the possible combinations of hyper-parametrs as separate jobs. The task queue will feed your jobs one at a time to your machine. Usually, you can also set the queue to execute number of processes concurrently (see details below).

    Specifically, I use task spooler, which is super easy to install and handful (doesn't requires admin privileges, details below).

    Basic usage is (see notes below about task spooler usage):

    ts <your-command>
    

    In practice, I have a separate python script that manages my experiments, set all the arguments per specific experiment and send the jobs to the ts queue.

    Here are some relevant snippets of python code from my experiments manager:

    run_bash executes a bash command

    def run_bash(cmd):
        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
        out = p.stdout.read().strip()
        return out  # This is the stdout from the shell command
    

    The next snippet sets the number of concurrent processes to be run (see note below about choosing the maximal number of processes):

    max_job_num_per_gpu = 2
    run_bash('ts -S %d'%max_job_num_per_gpu)
    

    The next snippet iterates through a list of all combinations of hyper params / model params. Each element of the list is a dictionary, where the keys are the command line arguments for the train.py script

    for combination_dict in combinations_list:
    
        job_cmd = 'python train.py ' + '  '.join(
                ['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])
    
        submit_cmd = "ts bash -c '%s'" % job_cmd
        run_bash(submit_cmd)
    

    A note about about choosing the maximal number of processes:

    If you are short on GPUs, you can use gpu_memory_fraction you found, to set the number of processes as max_job_num_per_gpu=int(1/gpu_memory_fraction)

    Notes about task spooler (ts):

    1. You could set the number of concurrent processes to run ("slots") with:

      ts -S <number-of-slots>

    2. Installing ts doesn't requires admin privileges. You can download and compile it from source with a simple make, add it to your path and you're done.

    3. You can set up multiple queues (I use it for multiple GPUs), with

      TS_SOCKET=<path_to_queue_name> ts <your-command>

      e.g.

      TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>

      TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>

    4. See here for further usage example

    A note about automatically setting the path names and file names: Once you separate your main code from the experiment manager, you will need an efficient way to generate file names and directory names, given the hyper-params. I usually keep my important hyper params in a dictionary and use the following function to generate a single chained string from the dictionary key-value pairs. Here are the functions I use for doing it:

    def build_string_from_dict(d, sep='%'):
        """
         Builds a string from a dictionary.
         Mainly used for formatting hyper-params to file names.
         Key-value pairs are sorted by the key name.
    
        Args:
            d: dictionary
    
        Returns: string
        :param d: input dictionary
        :param sep: key-value separator
    
        """
    
        return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])
    
    
    def _value2str(val):
        if isinstance(val, float): 
            # %g means: "Floating point format.
            # Uses lowercase exponential format if exponent is less than -4 or not less than precision,
            # decimal format otherwise."
            val = '%g' % val
        else:
            val = '{}'.format(val)
        val = re.sub('\.', '_', val)
        return val
    
    0 讨论(0)
  • 2021-01-30 14:51

    As I understand, firstly tensorflow constructs a symbolic graph and infers the derivatives based on chain rule. Then allocates memory for all (necessary) tensors, including some inputs and outputs of layers for efficiency. When running a session, data will be loaded into the graph but in general, memory use will not change any more.

    The error you met, I guess, may be caused by constructing several models in one GPU.

    Isolating your training/evaluation code from the hyper parameters is a good choice, as @user2476373 proposed. But I am using bash script directly, not task spooler (may be it's more convenient), e.g.

    CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
    CUDA_VISIBLE_DEVICES=0 python eval.py 
    

    Or you can write a 'for' loop in the bash script, not necessarily in python script. Noting that I used CUDA_VISIBLE_DEVICES=0 at beginning of the script (the index could be 7 if you have 8 GPUs in one machine). Because based on my experience, I've found that tensorflow uses all GPUs in one machine if I didn't specify operations use which GPU with the code like this

    with tf.device('/gpu:0'):
    

    If you want to try multi-GPU implementation, there is some example.

    Hope this could help you.

    0 讨论(0)
提交回复
热议问题