How to fix this strange error: “RuntimeError: CUDA error: out of memory”

后端未结

关注

 6  676

I ran a code about the deep learning network,first I trained the network,and it works well,but this error occurs when running to the validate network.

I have five epoch,

相关标签:

6条回答

轻奢々

2021-02-12 23:41
It might be for a number of reasons that I try to report in the following list:
1. Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
2. RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
3. Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
4. Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2021-02-12 23:45
The best way is to find the process engaging gpu memory and kill it:

find the PID of python process from:
```
nvidia-smi
```
copy the PID and kill it by:
```
sudo kill -9 pid
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
独厮守ぢ

2021-02-12 23:49
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
```
with torch.no_grad():
    ...
    net=Net()
    pred_for_validation=net(input)
    ...
```
Above code doesn't use GPU memory

2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

Even if docs guides with float(), in case of me, item() also worked like
```
entire_loss=0.0
for i in range(100):
    one_loss=loss_function(prediction,label)
    entire_loss+=one_loss.item()
```
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
```
for one_epoch in range(100):
    ...
    optimizer.step()
    del intermediate_variable1,intermediate_variable2,...
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2021-02-12 23:58

I faced the same issue with my computer. All you have to do is customize your cfg file that suits your computer.Turns out my computer takes image size below 600 X 600 and when I adjusted the same in config file, the program ran smoothly.Picture Describing my cfg file

0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2021-02-13 00:00

If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.

My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

0 讨论(0)

滥情空心

2021-02-13 00:04

The error, which you has provided is shown, because you ran out of memory on your GPU. A way to solve it is to reduce the batch size until your code will run without this error.

0 讨论(0)
发布评论:

提交评论
- 加载中...