在线性回归中讲述了原理,为了防止过拟合经常会加入正则化项。常用的正则化有L1正则化和L2正则化。
1.LASSO回归
加入L1正则化项的线性回归就叫LASSO回归。L1正则化项即是参数的L1范数,通俗点说,就是参数向量各个分量取绝对值的加和,即,对于\(\theta=(\theta_0, \theta_1, \cdots, \theta_n)^T\)参数向量,L1正则化项为:
\[ \left \| \theta \right \|_1 = \sum_{j=0}^n | \theta_j | \]
通常会加入一个系数\(\lambda\)来调节正则化项的权重,因此LASSO回归的目标函数(损失函数)为:
\[ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=0}^n | \theta_j | = \frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \lambda\left \| \theta \right \|_1 \]
LASSOS回归可以使得一些特征的系数为零(即某些\(\theta_j\)为零),即得到稀疏解。
由于\(|\theta_j|\)求不了导,所以在实际应用中,可以寻求近似解。对于函数\(f(x;\alpha) = x + \frac{1}{\alpha}\log(1+\exp(-\alpha x)), x\ge 0\),绝对值可以近似表示为:
\[ \begin{aligned} |x| &\approx f(x;\alpha) + f(-x;\alpha)\\ & = x + \frac{1}{\alpha}\log(1+\exp(-\alpha x)) - x + \frac{1}{\alpha}\log(1+\exp(\alpha x))\\ & = \frac{1}{\alpha}\left(\log(1+\exp(-\alpha x)) + \log(1 + \exp(\alpha x))\right)\\ \end{aligned} \]
因此,\(|x|\)的梯度和二阶导可以通过上面的近似表达式求得:
\[ \begin{aligned} \nabla |x| &\approx \frac{1}{\alpha}\left( \frac{-\alpha \exp(-\alpha x)}{1+\exp(-\alpha x)} + \frac{\alpha \exp(\alpha x)}{1+\exp(\alpha x)} \right)\\ &=\frac{\exp(\alpha x)}{1+\exp(\alpha x)} - \frac{\exp(-\alpha x)}{1+\exp(-\alpha x)}\\ &=\left( 1 - \frac{1}{1+\exp(\alpha x)} \right) - \left( 1 - \frac{1}{1+\exp(-\alpha x)} \right)\\ & = \frac{1}{1+\exp(-\alpha x)} - \frac{1}{1+\exp(\alpha x)}\\ \nabla^2 |x| &\approx \nabla( \frac{1}{1+\exp(-\alpha x)} - \frac{1}{1+\exp(\alpha x)})\\ &=\frac{\alpha\exp(-\alpha x)}{(1+\exp(-\alpha x))^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{\alpha\frac{1}{\exp(\alpha x)}}{(1+\frac{1}{\exp(\alpha x)})^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{\alpha\exp(\alpha x)}{(1+\exp(\alpha x))^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{2\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2} \end{aligned} \]
对于一般的问题,\(\alpha\)一般取\(10^6\)。
利用近似梯度,可得到目标函数的一次导数为:
\[ \frac{\partial J(\theta)}{\partial \theta} \approx X^TXθ−X^TY + \frac{\lambda}{1+\exp(-\alpha \theta)} - \frac{\lambda}{1+\exp(\alpha \theta)} \]
显然令导数为零也不好求得\(\theta\)的值,所以一般都是用坐标轴下降法和最小角回归法来求解。
2.Ridge回归
Ridge回归就是加入L2正则化项的线性回归。L2正则化项即是参数的L2范数,对于\(\theta=(\theta_0, \theta_1, \cdots, \theta_n)^T\)参数向量,L2正则化项为:
\[ \left \| \theta \right \|_2^2 = \sum_{j=0}^n \theta_j^2 \]
Ridge回归的目标函数(损失函数)为:
\[ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=0}^n \theta_j^2 = \frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \frac{1}{2}\lambda\left \| \theta \right \|_2^2 \]
式中第二个\(\frac{1]{2}\)是为了后面计算方便加入。Ridge回归不抛弃任何特征的系数(即\(\theta_j\)不为零),使得回归系数变小,相对比较稳定,但与LASSO回归比则保留的特征比较多。
Ridge回归的目标函数的一次导为:
\[ \begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &= \frac{\partial}{\partial \theta}\left(\frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \frac{1}{2}\lambda\left \| \theta \right \|_2^2\right)\\ &=X^TXθ−X^TY + \frac{\partial}{\partial \theta}\left( \frac{1}{2}\lambda\left \| \theta \right \|_2^2 \right)\\ &=X^TXθ−X^TY + \frac{\partial}{\partial \theta}\left( \frac{1}{2}\lambda \theta^T\theta \right)\\ &=X^TXθ−X^TY + \lambda \theta \end{aligned} \]
令导数为零,可求得\(\theta\)的值:
\[ \theta = (X^TX+\lambda I)^{-1}X^TY \]
可见得到的结果与线性回归里描述的解析解中加入扰动结果一致。