I want to implement a simple SVM classifier, in the case of high-dimensional binary data (text), for which I think a simple linear SVM is best. The reason for implementing it my
I would like to add a little supplement to the answer about original Platt's work. There is a bit simplified version presented in Stanford Lecture Notes, but derivation of all the formulas should be found somewhere else (e.g. this random notes I found on the Internet).
If it's ok to deviate from original implementations, I can propose you my own variation of the SMO algorithm that follows.
class SVM:
def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
self.kernel = {'poly':lambda x,y: np.dot(x, y.T)**degree,
'rbf':lambda x,y:np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
'linear':lambda x,y: np.dot(x, y.T)}[kernel]
self.C = C
self.max_iter = max_iter
def restrict_to_square(self, t, v0, u):
t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]
def fit(self, X, y):
self.X = X.copy()
self.y = y * 2 - 1
self.lambdas = np.zeros_like(self.y, dtype=float)
self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
for _ in range(self.max_iter):
for idxM in range(len(self.lambdas)):
idxL = np.random.randint(0, len(self.lambdas))
Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]]
v0 = self.lambdas[[idxM, idxL]]
k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)
u = np.array([-self.y[idxL], self.y[idxM]])
t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15)
self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
idx, = np.nonzero(self.lambdas > 1E-15)
self.b = np.sum((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx])/len(idx)
def decision_function(self, X):
return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b
In simple cases it works not much worth than sklearn.svm.SVC, comparison shown below (I have posted code that generates these images on GitHub)
I used quite different approach to derive formulas, you may want to check my preprint on ResearchGate for details.
Have a look at liblinear and for non linear SVM's at libsvm
The following paper "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM" top of page 11 describes the Pegasos algorithm also for kernels.It can be downloaded from http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
It appears to be a hybrid of coordinate descent and subgradient descent. Also, line 6 of the algorithm is wrong. In the predicate the second appearance of y_i_t should be replaced with y_j instead.
Are you interested in using kernels or not? Without kernels, the best way to solve these kinds of optimization problems is through various forms of stochastic gradient descent. A good version is described in http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf and that has an explicit algorithm.
The explicit algorithm does not work with kernels but can be modified; however, it would be more complex, both in terms of code and runtime complexity.
Some pseudocode for the Sequential Minimal Optimization (SMO) method can be found in this paper by John C. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. There is also a Java implementation of the SMO algorithm, which is developed for research and educational purpose (SVM-JAVA).
Other commonly used methods to solve the QP optimization problem include:
But be aware that some math knowledge is needed to understand this things (Lagrange multipliers, Karush–Kuhn–Tucker conditions, etc.).