How does one train multiple models in a single script in TensorFlow when there are GPUs present?

前端 未结 4 1125
暗喜
暗喜 2021-01-30 14:36

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM a

4条回答
  •  爱一瞬间的悲伤
    2021-01-30 14:51

    As I understand, firstly tensorflow constructs a symbolic graph and infers the derivatives based on chain rule. Then allocates memory for all (necessary) tensors, including some inputs and outputs of layers for efficiency. When running a session, data will be loaded into the graph but in general, memory use will not change any more.

    The error you met, I guess, may be caused by constructing several models in one GPU.

    Isolating your training/evaluation code from the hyper parameters is a good choice, as @user2476373 proposed. But I am using bash script directly, not task spooler (may be it's more convenient), e.g.

    CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
    CUDA_VISIBLE_DEVICES=0 python eval.py 
    

    Or you can write a 'for' loop in the bash script, not necessarily in python script. Noting that I used CUDA_VISIBLE_DEVICES=0 at beginning of the script (the index could be 7 if you have 8 GPUs in one machine). Because based on my experience, I've found that tensorflow uses all GPUs in one machine if I didn't specify operations use which GPU with the code like this

    with tf.device('/gpu:0'):
    

    If you want to try multi-GPU implementation, there is some example.

    Hope this could help you.

提交回复
热议问题