It might be for a number of reasons that I try to report in the following list:
- Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
- RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
- Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
- Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the
biggest_batch_first
description for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html