Support vector machines 几乎是最好的有监督学习算法。对于一个线性二分问题,设 y ∈ { − 1 , 1 } , x ∈ R n y \in \{-1,1\}, x \in \mathbb{R}^n y ∈ { − 1 , 1 } , x ∈ R n 。注意到我们没有使用增广形式的特征向量,而是使用线性参数 w w w 和 b b b 来表示h w , b ( x ) = g ( w T x + b )
h_{w,b}(x) = g(w^Tx+b)
h w , b ( x ) = g ( w T x + b )
其中g ( z ) = { 1 if z ≥ 0 − 1 otherwise
g(z) = \left\{\begin{array}{cl}
1 & \text{if }z \ge 0\\
-1 & \text{otherwise}
\end{array}\right.
g ( z ) = { 1 − 1 if z ≥ 0 otherwise
Optimal Margin Classifier
假设 S S S 线性可分,即存在 w , b w, b w , b 使所有样本被预测正确。可以认为 w T x + b w^Tx+b w T x + b 的绝对值反映了单次预测的可信度。定义 functional marginγ ^ ( i ) = y ( i ) ( w T x ( i ) + b )
\hat\gamma^{(i)} = y^{(i)}(w^Tx^{(i)}+b)
γ ^ ( i ) = y ( i ) ( w T x ( i ) + b )
如果 γ ^ ( i ) > 0 \hat\gamma^{(i)} > 0 γ ^ ( i ) > 0 则说明预测正确。而 γ ^ ( i ) \hat\gamma^{(i)} γ ^ ( i ) 越大,说明预测的可信度越高。定义样本集的 functional marginγ ^ = min i = 1 , … , m γ ^ ( i )
\hat\gamma = \min\limits_{i=1,\dots,m}\hat\gamma^{(i)}
γ ^ = i = 1 , … , m min γ ^ ( i )
则我们的目标可以表示为max w , b γ ^
\max\limits_{w, b}{\hat\gamma}
w , b max γ ^
但是如果 w w w 以一定比例增大 γ ^ \hat\gamma γ ^ 可以变得无穷大。因此我们还需要一个约束∣ ∣ w ∣ ∣ = 1
||w|| = 1
∣ ∣ w ∣ ∣ = 1
但是这个问题并不是凸优化,因此定义 geometric marginsγ ( i ) = y ( i ) ( ( w ∣ ∣ w ∣ ∣ ) T x ( i ) + b ∣ ∣ w ∣ ∣ ) γ = min i = 1 , … , m γ ( i )
\begin{array}{lcl}
\gamma^{(i)} &=& y^{(i)}\left(\left(\frac{w}{||w||}\right)^Tx^{(i)}+\frac{b}{||w||}\right)\\
\gamma &=& \min\limits_{i=1,\dots,m}\gamma^{(i)}
\end{array}
γ ( i ) γ = = y ( i ) ( ( ∣ ∣ w ∣ ∣ w ) T x ( i ) + ∣ ∣ w ∣ ∣ b ) i = 1 , … , m min γ ( i )
从几何角度考虑,γ ( i ) \gamma^{(i)} γ ( i ) 代表第 i 个样本到决策面的距离。因为距离决策面越远的点具有越高的可信度,所以原问题可以变为max w , b γ
\max\limits_{w, b}{\gamma}
w , b max γ
但是现在目标函数不是凸函数,因此还是不能应用凸优化。不过 w 的长度可以任取而不影响目标函数值了,而总有一个系数可以使 γ ^ = 1 \hat\gamma = 1 γ ^ = 1 成立。因此将原问题重述为min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s.t. y ( i ) ( w T x ( i ) + b ) ≥ 1 , i = 1 , … , m
\begin{array}{rll}
\min\limits_{w, b} & \frac{1}{2}||w||^2 &\\
\text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1, & i=1,\dots,m
\end{array}
w , b min s.t. 2 1 ∣ ∣ w ∣ ∣ 2 y ( i ) ( w T x ( i ) + b ) ≥ 1 , i = 1 , … , m
Soft Margin Classifier
对于线性不可分的 S S S 采用 l 1 l_1 l 1 regularization 重述问题min w , b , ξ 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ξ i s.t. y ( i ) ( w T x ( i ) + b ) ≥ 1 − ξ i , i = 1 , … , m ξ i ≥ 0 , i = 1 , … , m
\begin{array}{rll}
\min\limits_{w, b, \xi} & \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i&\\
\text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1 - \xi_i, & i=1,\dots,m\\
& \xi_i \ge 0, & i=1,\dots,m
\end{array}
w , b , ξ min s.t. 2 1 ∣ ∣ w ∣ ∣ 2 + C i = 1 ∑ m ξ i y ( i ) ( w T x ( i ) + b ) ≥ 1 − ξ i , ξ i ≥ 0 , i = 1 , … , m i = 1 , … , m
构造 generalized LagrangianL ( w , b , ξ , α , β ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ξ i − ∑ i = 1 m α i [ y ( i ) ( w T x ( i ) + b ) − 1 + ξ i ] − ∑ i = 1 m β i ξ i
L(w,b,\xi,\alpha,\beta) = \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i- \sum\limits_{i=1}^m \alpha_i[y^{(i)}(w^Tx^{(i)}+b)-1+\xi_i] - \sum\limits_{i=1}^m\beta_i\xi_i
L ( w , b , ξ , α , β ) = 2 1 ∣ ∣ w ∣ ∣ 2 + C i = 1 ∑ m ξ i − i = 1 ∑ m α i [ y ( i ) ( w T x ( i ) + b ) − 1 + ξ i ] − i = 1 ∑ m β i ξ i
容易验证w = 0 ⃗ b = 0 ξ i = 2
\begin{array}{rcl}
w &=& \vec 0\\
b &=& 0\\
\xi_i &=& 2
\end{array}
w b ξ i = = = 0 0 2
满足 the Slater’s condition, therefore all we have to do is solving the dual problemmax α θ D ( α , β )
\begin{array}{rll}
\max\limits_{\alpha} & \theta_D(\alpha, \beta)
\end{array}
α max θ D ( α , β ) By derivating the function∇ w L = w − ∑ i = 1 m α i y ( i ) x ( i ) ∂ ∂ b L = − ∑ i = 1 m α i y ( i ) ∂ ∂ ξ i L = C − α i − β i
\begin{array}{rcl}
\nabla_wL &=& w-\sum\limits_{i=1}^m \alpha_iy^{(i)}x^{(i)}\\
\frac{\partial}{\partial b}L &=& -\sum\limits_{i=1}^m \alpha_iy^{(i)}\\
\frac{\partial}{\partial\xi_i}L &=& C-\alpha_i-\beta_i
\end{array}
∇ w L ∂ b ∂ L ∂ ξ i ∂ L = = = w − i = 1 ∑ m α i y ( i ) x ( i ) − i = 1 ∑ m α i y ( i ) C − α i − β i We find that the optimal pointsw ∗ = ∑ i = 1 m α i y ( i ) x ( i ) b ∗ = − max i : y ( i ) = − 1 w ∗ T x ( i ) + min i : y ( i ) = 1 w ∗ T x ( i ) 2 w T x + b = ∑ i = 1 m α i y ( i ) ⟨ x ( i ) , x ⟩ + b ∗
\begin{array}{rcl}
w^* &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}x^{(i)}\\
b^* &=& -\frac{\max\limits_{i: y^{(i)} = -1}w^{*T}x^{(i)} + \min\limits_{i: y^{(i)} = 1}w^{*T}x^{(i)}}{2}\\
w^Tx + b &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}\langle x^{(i)}, x\rangle + b^*
\end{array}
w ∗ b ∗ w T x + b = = = i = 1 ∑ m α i y ( i ) x ( i ) − 2 i : y ( i ) = − 1 max w ∗ T x ( i ) + i : y ( i ) = 1 min w ∗ T x ( i ) i = 1 ∑ m α i y ( i ) ⟨ x ( i ) , x ⟩ + b ∗ Plugging back to the dual problemmax α , β ∑ i = 1 m α i − 1 2 ∑ i , j = 1 m y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩ s.t. 0 ≤ α i ≤ C , i = 1 , … , m ∑ i = 1 m α i y ( i ) = 0
\begin{array}{rll}
\max\limits_{\alpha, \beta} & \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle\\
\text{s.t.} & 0 \le \alpha_i \le C,\quad i=1,\dots,m\\
& \sum\limits_{i=1}^m \alpha_iy^{(i)} = 0
\end{array}
α , β max s.t. i = 1 ∑ m α i − 2 1 i , j = 1 ∑ m y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩ 0 ≤ α i ≤ C , i = 1 , … , m i = 1 ∑ m α i y ( i ) = 0 KKT complementary conditions require thatα i ∗ [ y ( i ) ( w ∗ T x ( i ) + b ∗ ) − 1 + ξ i ∗ ] = 0 β i ∗ ξ i ∗ = 0
\begin{array}{rcl}
\alpha_i^*[y^{(i)}(w^{*T}x^{(i)}+b^*)-1+\xi_i^*] &=& 0\\
\beta_i^*\xi_i^* &=& 0
\end{array}
α i ∗ [ y ( i ) ( w ∗ T x ( i ) + b ∗ ) − 1 + ξ i ∗ ] β i ∗ ξ i ∗ = = 0 0 after summerizationa i ∗ = 0 ⇒ ( γ ^ ( i ) ) ∗ ≥ 1 0 < a i ∗ < C ⇒ ( γ ^ ( i ) ) ∗ = 1 a i ∗ = C ⇒ ( γ ^ ( i ) ) ∗ ≤ 1
\begin{array}{rcl}
a_i^* = 0 &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \ge 1\\
0 < a_i^* < C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* = 1\\
a_i^* = C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \le 1\\
\end{array}
a i ∗ = 0 0 < a i ∗ < C a i ∗ = C ⇒ ⇒ ⇒ ( γ ^ ( i ) ) ∗ ≥ 1 ( γ ^ ( i ) ) ∗ = 1 ( γ ^ ( i ) ) ∗ ≤ 1
SMO
为了同时对 ( α 1 , α 2 , … , α m ) (\alpha_1, \alpha_2, \dots, \alpha_m) ( α 1 , α 2 , … , α m ) 进行优化,最直观的想法是使用坐标上升。但是在这个问题中,等式条件∑ i = 1 m α i y ( i ) = 0
\sum\limits_{i=1}^m \alpha_iy^{(i)} = 0
i = 1 ∑ m α i y ( i ) = 0
在改变一个变量的值后无法保持成立。因此 SMO 算法同时更新两个变量,以保持等式条件成立repeat { Select some pair ( α i , α j ) to update by heuristic Optimize W w.r.t. ( α i , α j ) }
\begin{aligned}
&\text{repeat}\ \{\\
&\qquad \text{Select some pair }(\alpha_i,\alpha_j)\text{ to update by heuristic}\\
&\qquad\text{Optimize }W\text{ w.r.t. }(\alpha_i,\alpha_j)\\
&\}
\end{aligned}
repeat { Select some pair ( α i , α j ) to update by heuristic Optimize W w.r.t. ( α i , α j ) }
其中W = ∑ i = 1 m α i − 1 2 ∑ i , j = 1 m y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩
W = \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle
W = i = 1 ∑ m α i − 2 1 i , j = 1 ∑ m y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩
为优化的目标函数。
Here, step 1 1 1 is beyond the scope and we will focus on step 2 2 2 . 设第一步中选取的优化变量为 ( α 1 , α 2 ) (\alpha_1,\alpha_2) ( α 1 , α 2 ) ,令常数ζ ≡ − ∑ i = 3 m α i y ( i ) = α 1 y ( 1 ) + α 2 y ( 2 )
\zeta \equiv -\sum\limits_{i=3}^m\alpha_iy^{(i)} = \alpha_1y^{(1)} + \alpha_2y^{(2)}
ζ ≡ − i = 3 ∑ m α i y ( i ) = α 1 y ( 1 ) + α 2 y ( 2 )
则两个优化变量实际可以简化为一个α 1 = ( ζ − α 2 y ( 2 ) ) y ( 1 )
\alpha_1 = (\zeta-\alpha_2y^{(2)})y^{(1)}
α 1 = ( ζ − α 2 y ( 2 ) ) y ( 1 )
由此将原问题转化为单变量优化问题α 2 : = arg max α ^ 2 W ( ( ζ − α ^ 2 y ( 2 ) ) y ( 1 ) , α ^ 2 , α 3 , … , α m )
\alpha_2 := \arg\max_{\hat\alpha_2} W((\zeta-\hat\alpha_2y^{(2)})y^{(1)},\hat\alpha_2,\alpha_3,\dots,\alpha_m)
α 2 : = arg α ^ 2 max W ( ( ζ − α ^ 2 y ( 2 ) ) y ( 1 ) , α ^ 2 , α 3 , … , α m )
如果变量的最优值违反了约束条件而无法取到,则选择这一变量可行域中最接近最优值的一侧边界作为新的变量值。
Kernels
When mapping from lower dimensional space to higher dimensional space, the original non-linearly separable data could become linearly separable. For instance
For a function K : R n × R n → R K: \mathbb{R}^n \times\mathbb{R}^n\rightarrow \mathbb{R} K : R n × R n → R define its corresponding matrix K = ( K i j ) ∈ R n × n K = (K_{ij}) \in \mathbb{R}^{n\times n} K = ( K i j ) ∈ R n × n K i j = K ( x ( i ) , x ( j ) )
K_{ij} = K(x^{(i)},x^{(j)})
K i j = K ( x ( i ) , x ( j ) ) Mercer theorem shows that
if and only if matrix K K K is symmetric positive semi-definite , then there exsit some ϕ \phi ϕ such that K ( x , z ) = ⟨ ϕ ( x ) , ϕ ( z ) ⟩ K(x,z) = \langle\phi(x),\phi(z)\rangle K ( x , z ) = ⟨ ϕ ( x ) , ϕ ( z ) ⟩ .
In this way the function K K K is called the kernel and the matrix K K K is called the kernel matrix . Commonly used kernels involve
Polynomial K ( x , z ) = ( x T z + c ) d K(x,z) = (x^Tz+c)^d K ( x , z ) = ( x T z + c ) d
Gaussian K ( x , z ) = e x p ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) K(x,z) = exp\Big(-\frac{||x-z||^2}{2\sigma^2}\Big) K ( x , z ) = e x p ( − 2 σ 2 ∣ ∣ x − z ∣ ∣ 2 )
Suppose we have an input attribute x x x but we want to apply SVM over the input features ϕ ( x ) \phi(x) ϕ ( x ) , where ϕ \phi ϕ is the feature mapping . All we have to do is using kernel to replace the inner products. This is useful because in many cases, calculating K K K is much more efficient than ⟨ ϕ ( x ( i ) ) , ϕ ( x ( j ) ) ⟩ \langle\phi(x^{(i)}), \phi(x^{(j)})\rangle ⟨ ϕ ( x ( i ) ) , ϕ ( x ( j ) ) ⟩ (if K K K corresponds to an infinite dimensional space like the Gaussian kernel, there maybe no easy way to work out the inner product of ϕ \phi ϕ ).