I\'ve been messing with Keras, and like it so far. There\'s one big issue I have been having, when working with fairly deep networks: When calling model.train_on_batch, or model
It is a very common mistake to forget that the activations, gradients and optimizer moment tracking variables also take VRRAM, not just the parameters, increasing memory usage quite a bit. The backprob calculations themselves make it so the training phase takes almost double the VRAM of forward / inference use of the neural net, and the Adam optimizer triples the space usage.
So, in the beginning when the network is created, only the parameters are allocated. However, when the training starts. the model actiavtions, backprop computations and the optimizer's tracking variables get allocated, increasing memory use by a large factor.
To allow the training of larger models, people:
Tools to train very large models: