I am an SVM newbie and this is my use case: I have a lot of unbalanced data to be binary classified using a linear SVM. I need to fix the false positives rate at certain values
The predict method for LinearSVC
in sklearn looks like this
def predict(self, X):
"""Predict class labels for samples in X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns
-------
C : array, shape = [n_samples]
Predicted class label per sample.
"""
scores = self.decision_function(X)
if len(scores.shape) == 1:
indices = (scores > 0).astype(np.int)
else:
indices = scores.argmax(axis=1)
return self.classes_[indices]
So in addition to what mbatchkarov
suggested you can change the decisions made by the classifier (any classifier really) by changing the boundary at which the classifier says something is of one class or the other.
from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
data = load_iris()
# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1]
y = data.target[:100]
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]
decision_boundary = 0
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({1: 27, 0: 23})
decision_boundary = 0.5
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({0: 39, 1: 11})
You can optimize the decision boundary to be anything depending on your needs.
The class_weights
parameter allows you to push this false positive rate up or down. Let me use an everyday example to illustrate how this work. Suppose you own a night club, and you operate under two constraints:
On an average day, (say) only 5% percent of the people attempting to enter the club will be underage. You are faced with a choice: being lenient or being strict. The former will boost your profits by as much as 5%, but you are running the risk of an expensive lawsuit. The latter will inevitably mean some people who are just above the legal age will be denied entry, which will cost you money too. You want to adjust the relative cost
of leniency vs strictness. Note: you cannot directly control how many underage people enter the club, but you can control how strict your bouncers are.
Here is a bit of Python that shows what happens as you change the relative importance.
from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
data = load_iris()
# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1]
y = data.target[:100]
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]
for i in range(1, 20):
clf = LinearSVC(class_weight={0: 1, 1: i})
clf = clf.fit(X[:50, :], y[:50])
print i, Counter(clf.predict(X[50:]))
# print clf.decision_function(X[50:])
Which outputs
1 Counter({1: 22, 0: 28})
2 Counter({1: 31, 0: 19})
3 Counter({1: 39, 0: 11})
4 Counter({1: 43, 0: 7})
5 Counter({1: 43, 0: 7})
6 Counter({1: 44, 0: 6})
7 Counter({1: 44, 0: 6})
8 Counter({1: 44, 0: 6})
9 Counter({1: 47, 0: 3})
10 Counter({1: 47, 0: 3})
11 Counter({1: 47, 0: 3})
12 Counter({1: 47, 0: 3})
13 Counter({1: 47, 0: 3})
14 Counter({1: 47, 0: 3})
15 Counter({1: 47, 0: 3})
16 Counter({1: 47, 0: 3})
17 Counter({1: 48, 0: 2})
18 Counter({1: 48, 0: 2})
19 Counter({1: 48, 0: 2})
Note how the number of data points classified as 0
decreases are the relative weight of class 1
increases. Assuming you have the computational resources and time to train and evaluate 10 classifiers, you can plot the precision and recall of each one and get a figure like the one below (shamelessly stolen off the internet). You can then use that to decide what the right value of class_weights
is for your use case.