I'm trying to do the following simple classification using the LinearSVC
object in scikit-learn
. I've tried using both version 0.10 and 0.14. Using the code:
from sklearn.svm import LinearSVC, SVC
from numpy import *
data = array([[ 1007., 1076.],
[ 1017., 1009.],
[ 2021., 2029.],
[ 2060., 2085.]])
groups = array([1, 1, 2, 2])
svc = LinearSVC()
svc.fit(data, groups)
I get the output:
array([2, 2, 2, 2])
However, if I replace the classifier with
svc = SVC(kernel='linear')
then I get the result
array([ 1., 1., 2., 2.])
which is correct. Does anyone know why using LinearSVC
would botch this simple problem?
The algorithm underlying LinearSVC
is very sensitive to extreme values in its input:
>>> svc = LinearSVC(verbose=1)
>>> svc.fit(data, groups)
optimization finished, #iter = 1000
WARNING: reaching max number of iterations
Using -s 2 may be faster (also see FAQ)
Objective value = -0.001256
nSV = 4
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.0001, verbose=1)
(The warning refers to the LibLinear FAQ, since scikit-learn's LinearSVC
is based on that library.)
You should normalize before fitting:
>>> from sklearn.preprocessing import scale
>>> data = scale(data)
>>> svc.fit(data, groups)
optimization finished, #iter = 39
Objective value = -0.240988
nSV = 4
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.0001, verbose=1)
>>> svc.predict(data)
array([1, 1, 2, 2])