ML Note 1.3 - SVM

可紊 提交于 2019-12-02 09:15:48

Support vector machines 几乎是最好的有监督学习算法。对于一个线性二分问题,设 y{1,1},xRny \in \{-1,1\}, x \in \mathbb{R}^n。注意到我们没有使用增广形式的特征向量,而是使用线性参数 wwbb 来表示
hw,b(x)=g(wTx+b) h_{w,b}(x) = g(w^Tx+b)

其中
g(z)={1if z01otherwise g(z) = \left\{\begin{array}{cl} 1 & \text{if }z \ge 0\\ -1 & \text{otherwise} \end{array}\right.

Optimal Margin Classifier

假设 SS 线性可分,即存在 w,bw, b 使所有样本被预测正确。可以认为 wTx+bw^Tx+b 的绝对值反映了单次预测的可信度。定义 functional margin
γ^(i)=y(i)(wTx(i)+b) \hat\gamma^{(i)} = y^{(i)}(w^Tx^{(i)}+b)

如果 γ^(i)>0\hat\gamma^{(i)} > 0 则说明预测正确。而 γ^(i)\hat\gamma^{(i)} 越大,说明预测的可信度越高。定义样本集的 functional margin
γ^=mini=1,,mγ^(i) \hat\gamma = \min\limits_{i=1,\dots,m}\hat\gamma^{(i)}

则我们的目标可以表示为
maxw,bγ^ \max\limits_{w, b}{\hat\gamma}

但是如果 ww 以一定比例增大 γ^\hat\gamma 可以变得无穷大。因此我们还需要一个约束
w=1 ||w|| = 1

但是这个问题并不是凸优化,因此定义 geometric margins
γ(i)=y(i)((ww)Tx(i)+bw)γ=mini=1,,mγ(i) \begin{array}{lcl} \gamma^{(i)} &=& y^{(i)}\left(\left(\frac{w}{||w||}\right)^Tx^{(i)}+\frac{b}{||w||}\right)\\ \gamma &=& \min\limits_{i=1,\dots,m}\gamma^{(i)} \end{array}

从几何角度考虑,γ(i)\gamma^{(i)} 代表第 i 个样本到决策面的距离。因为距离决策面越远的点具有越高的可信度,所以原问题可以变为
maxw,bγ \max\limits_{w, b}{\gamma}

但是现在目标函数不是凸函数,因此还是不能应用凸优化。不过 w 的长度可以任取而不影响目标函数值了,而总有一个系数可以使 γ^=1\hat\gamma = 1 成立。因此将原问题重述为
minw,b12w2s.t.y(i)(wTx(i)+b)1,i=1,,m \begin{array}{rll} \min\limits_{w, b} & \frac{1}{2}||w||^2 &\\ \text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1, & i=1,\dots,m \end{array}

Soft Margin Classifier

对于线性不可分的 SS 采用 l1l_1 regularization 重述问题
minw,b,ξ12w2+Ci=1mξis.t.y(i)(wTx(i)+b)1ξi,i=1,,mξi0,i=1,,m \begin{array}{rll} \min\limits_{w, b, \xi} & \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i&\\ \text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1 - \xi_i, & i=1,\dots,m\\ & \xi_i \ge 0, & i=1,\dots,m \end{array}

构造 generalized Lagrangian
L(w,b,ξ,α,β)=12w2+Ci=1mξii=1mαi[y(i)(wTx(i)+b)1+ξi]i=1mβiξi L(w,b,\xi,\alpha,\beta) = \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i- \sum\limits_{i=1}^m \alpha_i[y^{(i)}(w^Tx^{(i)}+b)-1+\xi_i] - \sum\limits_{i=1}^m\beta_i\xi_i

容易验证
w=0b=0ξi=2 \begin{array}{rcl} w &=& \vec 0\\ b &=& 0\\ \xi_i &=& 2 \end{array}

满足 the Slater’s condition, therefore all we have to do is solving the dual problem
maxαθD(α,β) \begin{array}{rll} \max\limits_{\alpha} & \theta_D(\alpha, \beta) \end{array} By derivating the function
wL=wi=1mαiy(i)x(i)bL=i=1mαiy(i)ξiL=Cαiβi \begin{array}{rcl} \nabla_wL &=& w-\sum\limits_{i=1}^m \alpha_iy^{(i)}x^{(i)}\\ \frac{\partial}{\partial b}L &=& -\sum\limits_{i=1}^m \alpha_iy^{(i)}\\ \frac{\partial}{\partial\xi_i}L &=& C-\alpha_i-\beta_i \end{array} We find that the optimal points
w=i=1mαiy(i)x(i)b=maxi:y(i)=1wTx(i)+mini:y(i)=1wTx(i)2wTx+b=i=1mαiy(i)x(i),x+b \begin{array}{rcl} w^* &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}x^{(i)}\\ b^* &=& -\frac{\max\limits_{i: y^{(i)} = -1}w^{*T}x^{(i)} + \min\limits_{i: y^{(i)} = 1}w^{*T}x^{(i)}}{2}\\ w^Tx + b &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}\langle x^{(i)}, x\rangle + b^* \end{array} Plugging back to the dual problem
maxα,βi=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j)s.t.0αiC,i=1,,mi=1mαiy(i)=0 \begin{array}{rll} \max\limits_{\alpha, \beta} & \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle\\ \text{s.t.} & 0 \le \alpha_i \le C,\quad i=1,\dots,m\\ & \sum\limits_{i=1}^m \alpha_iy^{(i)} = 0 \end{array} KKT complementary conditions require that
αi[y(i)(wTx(i)+b)1+ξi]=0βiξi=0 \begin{array}{rcl} \alpha_i^*[y^{(i)}(w^{*T}x^{(i)}+b^*)-1+\xi_i^*] &=& 0\\ \beta_i^*\xi_i^* &=& 0 \end{array} after summerization
ai=0(γ^(i))10<ai<C(γ^(i))=1ai=C(γ^(i))1 \begin{array}{rcl} a_i^* = 0 &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \ge 1\\ 0 < a_i^* < C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* = 1\\ a_i^* = C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \le 1\\ \end{array}

SMO1

为了同时对 (α1,α2,,αm)(\alpha_1, \alpha_2, \dots, \alpha_m) 进行优化,最直观的想法是使用坐标上升。但是在这个问题中,等式条件
i=1mαiy(i)=0 \sum\limits_{i=1}^m \alpha_iy^{(i)} = 0

在改变一个变量的值后无法保持成立。因此 SMO 算法同时更新两个变量,以保持等式条件成立
repeat {Select some pair (αi,αj) to update by heuristicOptimize W w.r.t. (αi,αj)} \begin{aligned} &\text{repeat}\ \{\\ &\qquad \text{Select some pair }(\alpha_i,\alpha_j)\text{ to update by heuristic}\\ &\qquad\text{Optimize }W\text{ w.r.t. }(\alpha_i,\alpha_j)\\ &\} \end{aligned}

其中
W=i=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j) W = \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle

为优化的目标函数。
Here, step 11 is beyond the scope and we will focus on step 22. 设第一步中选取的优化变量为 (α1,α2)(\alpha_1,\alpha_2),令常数
ζi=3mαiy(i)=α1y(1)+α2y(2) \zeta \equiv -\sum\limits_{i=3}^m\alpha_iy^{(i)} = \alpha_1y^{(1)} + \alpha_2y^{(2)}

则两个优化变量实际可以简化为一个
α1=(ζα2y(2))y(1) \alpha_1 = (\zeta-\alpha_2y^{(2)})y^{(1)}

由此将原问题转化为单变量优化问题
α2:=argmaxα^2W((ζα^2y(2))y(1),α^2,α3,,αm) \alpha_2 := \arg\max_{\hat\alpha_2} W((\zeta-\hat\alpha_2y^{(2)})y^{(1)},\hat\alpha_2,\alpha_3,\dots,\alpha_m)

如果变量的最优值违反了约束条件而无法取到,则选择这一变量可行域中最接近最优值的一侧边界作为新的变量值。

Kernels

When mapping from lower dimensional space to higher dimensional space, the original non-linearly separable data could become linearly separable. For instance

kernel
For a function K:Rn×RnRK: \mathbb{R}^n \times\mathbb{R}^n\rightarrow \mathbb{R} define its corresponding matrix K=(Kij)Rn×nK = (K_{ij}) \in \mathbb{R}^{n\times n}
Kij=K(x(i),x(j)) K_{ij} = K(x^{(i)},x^{(j)}) Mercer theorem shows that

if and only if matrix KK is symmetric positive semi-definite, then there exsit some ϕ\phi such that K(x,z)=ϕ(x),ϕ(z)K(x,z) = \langle\phi(x),\phi(z)\rangle.

In this way the function KK is called the kernel and the matrix KK is called the kernel matrix. Commonly used kernels involve

  • Polynomial K(x,z)=(xTz+c)dK(x,z) = (x^Tz+c)^d
  • Gaussian K(x,z)=exp(xz22σ2)K(x,z) = exp\Big(-\frac{||x-z||^2}{2\sigma^2}\Big)

Suppose we have an input attribute xx but we want to apply SVM over the input features ϕ(x)\phi(x), where ϕ\phi is the feature mapping. All we have to do is using kernel to replace the inner products. This is useful because in many cases, calculating KK is much more efficient than ϕ(x(i)),ϕ(x(j))\langle\phi(x^{(i)}), \phi(x^{(j)})\rangle (if KK corresponds to an infinite dimensional space like the Gaussian kernel, there maybe no easy way to work out the inner product of ϕ\phi).


  1. 【机器学习详解】SMO算法剖析 ↩︎

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!