问题
I am trying to implement L1 regularization onto the first layer of a simple neural network (1 hidden layer). I looked into some other posts on StackOverflow that apply l1 regularization using Pytorch to figure out how it should be done (references: Adding L1/L2 regularization in PyTorch?, In Pytorch, how to add L1 regularizer to activations?). No matter how high I increase lambda (the l1 regularization strength parameter) I do not get true zeros in the first weight matrix. Why would this be? (Code is below)
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class Network(nn.Module):
def __init__(self,nf,nh,nc):
super(Network,self).__init__()
self.lin1=nn.Linear(nf,nh)
self.lin2=nn.Linear(nh,nc)
def forward(self,x):
l1out=F.relu(self.lin1(x))
out=F.softmax(self.lin2(l1out))
return out, l1out
def l1loss(layer):
return torch.norm(layer.weight.data, p=1)
nf=10
nc=2
nh=6
learningrate=0.02
lmbda=10.
batchsize=50
net=Network(nf,nh,nc)
crit=nn.MSELoss()
optimizer=torch.optim.Adagrad(net.parameters(),lr=learningrate)
xtr=torch.Tensor(xtr)
ytr=torch.Tensor(ytr)
#ytr=torch.LongTensor(ytr)
xte=torch.Tensor(xte)
yte=torch.LongTensor(yte)
#cyte=torch.Tensor(yte)
it=200
for epoch in range(it):
per=torch.randperm(len(xtr))
for i in range(0,len(xtr),batchsize):
ind=per[i:i+batchsize]
bx,by=xtr[ind],ytr[ind]
optimizer.zero_grad()
output, l1out=net(bx)
# l1reg=l1loss(net.lin1)
loss=crit(output,by)+lmbda*l1loss(net.lin1)
loss.backward()
optimizer.step()
print('Epoch [%i/%i], Loss: %.4f' %(epoch+1,it, np.float32(loss.data.numpy())))
corr=0
tot=0
for x,y in list(zip(xte,yte)):
output,_=net(x)
_,pred=torch.max(output,-1)
tot+=1 #y.size(0)
corr+=(pred==y).sum()
print(corr)
Note: The data has 10 features (2 classes and 800 training samples) and only the first 2 are relevant (by design) so one would assume true zeros should be easy enough to learn.
回答1:
Your usage of layer.weight.data
removes the parameter (which is a PyTorch variable) from its automatic differentiation context, making it a constant when the optimiser takes the gradients. This results in zero gradients and that the L1 loss is not computed.
If you remove the .data
, the norm is computed of the PyTorch variable and the gradients should be correct.
For more information on PyTorch's automatic differentiation mechanics, see this docs article or this tutorial.
来源:https://stackoverflow.com/questions/50054049/lack-of-sparse-solution-with-l1-regularization-in-pytorch