How to fix this strange error: “RuntimeError: CUDA error: out of memory”

后端 未结 6 676
时光取名叫无心
时光取名叫无心 2021-02-12 22:59

I ran a code about the deep learning network,first I trained the network,and it works well,but this error occurs when running to the validate network.

I have five epoch,

相关标签:
6条回答
  • 2021-02-12 23:41

    It might be for a number of reasons that I try to report in the following list:

    1. Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
    2. RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
    3. Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
    4. Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.

    In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html

    0 讨论(0)
  • 2021-02-12 23:45

    The best way is to find the process engaging gpu memory and kill it:

    find the PID of python process from:

    nvidia-smi
    

    copy the PID and kill it by:

    sudo kill -9 pid
    
    0 讨论(0)
  • 2021-02-12 23:49

    1.. When you only perform validation not training,
    you don't need to calculate gradients for forward and backward phase.
    In that situation, your code can be located under

    with torch.no_grad():
        ...
        net=Net()
        pred_for_validation=net(input)
        ...
    

    Above code doesn't use GPU memory

    2.. If you use += operator in your code,
    it can accumulate gradient continuously in your gradient graph.
    In that case, you need to use float() like following site
    https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

    Even if docs guides with float(), in case of me, item() also worked like

    entire_loss=0.0
    for i in range(100):
        one_loss=loss_function(prediction,label)
        entire_loss+=one_loss.item()
    

    3.. If you use for loop in training code,
    data can be sustained until entire for loop ends.
    So, in that case, you can explicitly delete variables after performing optimizer.step()

    for one_epoch in range(100):
        ...
        optimizer.step()
        del intermediate_variable1,intermediate_variable2,...
    
    0 讨论(0)
  • 2021-02-12 23:58

    I faced the same issue with my computer. All you have to do is customize your cfg file that suits your computer.Turns out my computer takes image size below 600 X 600 and when I adjusted the same in config file, the program ran smoothly.Picture Describing my cfg file

    0 讨论(0)
  • 2021-02-13 00:00

    If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.

    My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:

    from fastai.vision.all import *
    path = untar_data(URLs.PETS)/'images'
    
    def is_cat(x): return x[0].isupper()
    dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2, seed=42,
        label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)
    
    learn = cnn_learner(dls, resnet34, metrics=error_rate)
    learn.fine_tune(1)
    
    0 讨论(0)
  • 2021-02-13 00:04

    The error, which you has provided is shown, because you ran out of memory on your GPU. A way to solve it is to reduce the batch size until your code will run without this error.

    0 讨论(0)
提交回复
热议问题