Lack of Sparse Solution with L1 Regularization in Pytorch

落花浮王杯 提交于 2021-01-27 12:52:48

问题


I am trying to implement L1 regularization onto the first layer of a simple neural network (1 hidden layer). I looked into some other posts on StackOverflow that apply l1 regularization using Pytorch to figure out how it should be done (references: Adding L1/L2 regularization in PyTorch?, In Pytorch, how to add L1 regularizer to activations?). No matter how high I increase lambda (the l1 regularization strength parameter) I do not get true zeros in the first weight matrix. Why would this be? (Code is below)

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class Network(nn.Module):
    def __init__(self,nf,nh,nc):
        super(Network,self).__init__()
        self.lin1=nn.Linear(nf,nh)
        self.lin2=nn.Linear(nh,nc)

    def forward(self,x):
        l1out=F.relu(self.lin1(x))
        out=F.softmax(self.lin2(l1out))
        return out, l1out

def l1loss(layer):
    return torch.norm(layer.weight.data, p=1)

nf=10
nc=2
nh=6
learningrate=0.02
lmbda=10.
batchsize=50

net=Network(nf,nh,nc)

crit=nn.MSELoss()
optimizer=torch.optim.Adagrad(net.parameters(),lr=learningrate)


xtr=torch.Tensor(xtr)
ytr=torch.Tensor(ytr)
#ytr=torch.LongTensor(ytr)
xte=torch.Tensor(xte)
yte=torch.LongTensor(yte)
#cyte=torch.Tensor(yte)

it=200
for epoch in range(it):
    per=torch.randperm(len(xtr))
    for i in range(0,len(xtr),batchsize):
        ind=per[i:i+batchsize]
        bx,by=xtr[ind],ytr[ind]            
        optimizer.zero_grad()
        output, l1out=net(bx)
#        l1reg=l1loss(net.lin1)    
        loss=crit(output,by)+lmbda*l1loss(net.lin1)
        loss.backward()
        optimizer.step()
    print('Epoch [%i/%i], Loss: %.4f' %(epoch+1,it, np.float32(loss.data.numpy())))

corr=0
tot=0
for x,y in list(zip(xte,yte)):
    output,_=net(x)
    _,pred=torch.max(output,-1)
    tot+=1 #y.size(0)
    corr+=(pred==y).sum()
print(corr)

Note: The data has 10 features (2 classes and 800 training samples) and only the first 2 are relevant (by design) so one would assume true zeros should be easy enough to learn.


回答1:


Your usage of layer.weight.data removes the parameter (which is a PyTorch variable) from its automatic differentiation context, making it a constant when the optimiser takes the gradients. This results in zero gradients and that the L1 loss is not computed.

If you remove the .data, the norm is computed of the PyTorch variable and the gradients should be correct.

For more information on PyTorch's automatic differentiation mechanics, see this docs article or this tutorial.



来源:https://stackoverflow.com/questions/50054049/lack-of-sparse-solution-with-l1-regularization-in-pytorch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!