Why do we need to call zero_grad() in PyTorch?

后端 未结 2 1788
清歌不尽
清歌不尽 2020-12-02 05:03

The method zero_grad() needs to be called during training. But the documentation is not very helpful

|  zero_grad(self)
|      Sets gradients of         


        
相关标签:
2条回答
  • 2020-12-02 05:11

    zero_grad() is restart looping without losses from last step if you use the gradient method for decreasing the error (or losses)

    if you don't use zero_grad() the loss will be decrease not increase as require

    for example if you use zero_grad() you will find following output :

    model training loss is 1.5
    model training loss is 1.4
    model training loss is 1.3
    model training loss is 1.2
    

    if you don't use zero_grad() you will find following output :

    model training loss is 1.4
    model training loss is 1.9
    model training loss is 2
    model training loss is 2.8
    model training loss is 3.5
    
    0 讨论(0)
  • 2020-12-02 05:23

    In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call.

    Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

    Here is a simple example:

    import torch
    from torch.autograd import Variable
    import torch.optim as optim
    
    def linear_model(x, W, b):
        return torch.matmul(x, W) + b
    
    data, targets = ...
    
    W = Variable(torch.randn(4, 3), requires_grad=True)
    b = Variable(torch.randn(3), requires_grad=True)
    
    optimizer = optim.Adam([W, b])
    
    for sample, target in zip(data, targets):
        # clear out the gradients of all Variables 
        # in this optimizer (i.e. W, b)
        optimizer.zero_grad()
        output = linear_model(sample, W, b)
        loss = (output - target) ** 2
        loss.backward()
        optimizer.step()
    

    Alternatively, if you're doing a vanilla gradient descent, then:

    W = Variable(torch.randn(4, 3), requires_grad=True)
    b = Variable(torch.randn(3), requires_grad=True)
    
    for sample, target in zip(data, targets):
        # clear out the gradients of Variables 
        # (i.e. W, b)
        W.grad.data.zero_()
        b.grad.data.zero_()
    
        output = linear_model(sample, W, b)
        loss = (output - target) ** 2
        loss.backward()
    
        W -= learning_rate * W.grad.data
        b -= learning_rate * b.grad.data
    

    Note: The accumulation (i.e. sum) of gradients happen when .backward() is called on the loss tensor.

    0 讨论(0)
提交回复
热议问题