Keras uses way too much GPU memory when calling train_on_batch, fit, etc

前端 未结 3 1336
甜味超标
甜味超标 2021-02-01 02:21

I\'ve been messing with Keras, and like it so far. There\'s one big issue I have been having, when working with fairly deep networks: When calling model.train_on_batch, or model

相关标签:
3条回答
  • 2021-02-01 02:35

    Both Theano and Tensorflow augments the symbolic graph that is created, though both differently.

    To analyze how the memory consumption is happening you can start with a smaller model and grow it to see the corresponding growth in memory. Similarly you can grow the batch_size to see the corresponding growth in memory.

    Here is a code snippet for increasing batch_size based on your initial code:

    from scipy import misc
    import numpy as np
    from keras.models import Sequential
    from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Reshape, Flatten, ZeroPadding2D, Dropout
    import os
    import matplotlib.pyplot as plt
    
    
    def gpu_memory():
        out = os.popen("nvidia-smi").read()
        ret = '0MiB'
        for item in out.split("\n"):
            if str(os.getpid()) in item and 'python' in item:
                ret = item.strip().split(' ')[-2]
        return float(ret[:-3])
    
    gpu_mem = []
    gpu_mem.append(gpu_memory())
    
    model = Sequential()
    model.add(Convolution2D(100, 3, 3, border_mode='same', input_shape=(16,16,1)))
    model.add(Convolution2D(256, 3, 3, border_mode='same'))
    model.add(Convolution2D(32, 3, 3, border_mode='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Flatten())
    model.add(Dense(4))
    model.add(Dense(1))
    
    model.summary()
    gpu_mem.append(gpu_memory())
    
    model.compile(optimizer='sgd',
                  loss='mse', 
                  metrics=['accuracy'])
    gpu_mem.append(gpu_memory())
    
    
    batches = []
    n_batches = 20
    batch_size = 1
    for ibatch in range(n_batches):
        batch_size = (ibatch+1)*10
        batches.append(batch_size)
        x = np.random.rand(batch_size, 16,16,1)
        y = np.random.rand(batch_size, 1)
        print y.shape
    
        model.train_on_batch(x, y)         
        print("Trained one iteration")
    
        gpu_mem.append(gpu_memory())
    
    fig = plt.figure()
    plt.plot([-100, -50, 0]+batches, gpu_mem)
    plt.show()
    

    Also, for speed Tensorflow hogs up the all available GPU memory. To stop that and you need to add config.gpu_options.allow_growth = True in get_session()

    # keras/backend/tensorflow_backend.py
    def get_session():
        global _SESSION
        if tf.get_default_session() is not None:
            session = tf.get_default_session()
        else:
            if _SESSION is None:
                if not os.environ.get('OMP_NUM_THREADS'):
                    config = tf.ConfigProto(allow_soft_placement=True,
                        )
                else:
                    nb_thread = int(os.environ.get('OMP_NUM_THREADS'))
                    config = tf.ConfigProto(intra_op_parallelism_threads=nb_thread,
                                            allow_soft_placement=True)
                config.gpu_options.allow_growth = True
                _SESSION = tf.Session(config=config)
            session = _SESSION
        if not _MANUAL_VAR_INIT:
            _initialize_variables()
        return session
    

    Now if you run the prev snippet you get plots like:

    Theano: Tensorflow:

    Theano: After model.compile() whatever the memory is needed, during the start of training, it almost doubles. This is because Theano augments the symbolic graph to do back-propagation and each tensor needs a corresponding tensor to achieve the backward flow of gradients. The memory needs don't seem to grow with batch_size and this is unexpected to me as the placeholder size should increase to accommodate the data inflow from CPU->GPU.

    Tensorflow: No GPU memory is allocated even after model.compile() as Keras don't call get_session() till that time which actually calls _initialize_variables(). Tensorflow seems to hog memory in chunks for speed and so the memory don't grow linearly with batch_size.

    Having said all that Tensorflow seems to be memory hungry but for big graphs its very fast.. Theano on the other hand is very gpu memory efficient but takes a hell lot of time to initialize the graph at the start of training. After that its also pretty fast.

    0 讨论(0)
  • 2021-02-01 02:38

    200M params for 2 Gb GPU is toooo much. Also your architecture not efficient, using local bottlenecks will be more efficient. Also you should go from small model to big, and not backwards, right now you have input 16x16, with this architecture that means that at the end most of your network will be "zero padded" and not based on input features. Your model layers depends on your input, so you cant just set arbitrary number of layers and sizes, you need count how much data will be passed to each of them, with understanding why are doing so. I would recommend you to watch this free course http://cs231n.github.io

    0 讨论(0)
  • 2021-02-01 02:47

    It is a very common mistake to forget that the activations, gradients and optimizer moment tracking variables also take VRRAM, not just the parameters, increasing memory usage quite a bit. The backprob calculations themselves make it so the training phase takes almost double the VRAM of forward / inference use of the neural net, and the Adam optimizer triples the space usage.

    So, in the beginning when the network is created, only the parameters are allocated. However, when the training starts. the model actiavtions, backprop computations and the optimizer's tracking variables get allocated, increasing memory use by a large factor.

    To allow the training of larger models, people:

    • use model parallelism to spread the weights and computations over different accelerators
    • use gradient checkpointing, which allows a tradeoff between more computation vs lower memory use during back-propagation.
    • Potentially use a memory efficient optimizer that aims to reduce the number of tracking variables, such as Adafactor, for which you will find implementations for all popular deep learning frameworks.

    Tools to train very large models:

    • Mesh-Tensorflow https://arxiv.org/abs/1811.02084 https://github.com/tensorflow/mesh
    • Microsoft DeepSpeed: https://github.com/microsoft/DeepSpeed https://www.deepspeed.ai/
    • Facebook FairScale: https://github.com/facebookresearch/fairscale
    • Megatron-LM: https://arxiv.org/abs/1909.08053 https://github.com/NVIDIA/Megatron-LM
    • Article on integration in HuggingFace Transformers: https://huggingface.co/blog/zero-deepspeed-fairscale
    0 讨论(0)
提交回复
热议问题