How to fix the false positives rate of a linear SVM?

前端 未结 2 1519
盖世英雄少女心
盖世英雄少女心 2021-02-20 17:44

I am an SVM newbie and this is my use case: I have a lot of unbalanced data to be binary classified using a linear SVM. I need to fix the false positives rate at certain values

相关标签:
2条回答
  • 2021-02-20 18:01

    The predict method for LinearSVC in sklearn looks like this

    def predict(self, X):
        """Predict class labels for samples in X.
    
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Samples.
    
        Returns
        -------
        C : array, shape = [n_samples]
            Predicted class label per sample.
        """
        scores = self.decision_function(X)
        if len(scores.shape) == 1:
            indices = (scores > 0).astype(np.int)
        else:
            indices = scores.argmax(axis=1)
        return self.classes_[indices]
    

    So in addition to what mbatchkarov suggested you can change the decisions made by the classifier (any classifier really) by changing the boundary at which the classifier says something is of one class or the other.

    from collections import Counter
    import numpy as np
    from sklearn.datasets import load_iris
    from sklearn.svm import LinearSVC
    
    data = load_iris()
    
    # remove a feature to make the problem harder
    # remove the third class for simplicity
    X = data.data[:100, 0:1] 
    y = data.target[:100] 
    # shuffle data
    indices = np.arange(y.shape[0])
    np.random.shuffle(indices)
    X = X[indices, :]
    y = y[indices]
    
    decision_boundary = 0
    print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
    Counter({1: 27, 0: 23})
    
    decision_boundary = 0.5
    print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
    Counter({0: 39, 1: 11})
    

    You can optimize the decision boundary to be anything depending on your needs.

    0 讨论(0)
  • 2021-02-20 18:11

    The class_weights parameter allows you to push this false positive rate up or down. Let me use an everyday example to illustrate how this work. Suppose you own a night club, and you operate under two constraints:

    1. You want as many people as possible to enter the club (paying customers)
    2. You do not want any underage people in, as this will get you in trouble with the state

    On an average day, (say) only 5% percent of the people attempting to enter the club will be underage. You are faced with a choice: being lenient or being strict. The former will boost your profits by as much as 5%, but you are running the risk of an expensive lawsuit. The latter will inevitably mean some people who are just above the legal age will be denied entry, which will cost you money too. You want to adjust the relative cost of leniency vs strictness. Note: you cannot directly control how many underage people enter the club, but you can control how strict your bouncers are.

    Here is a bit of Python that shows what happens as you change the relative importance.

    from collections import Counter
    import numpy as np
    from sklearn.datasets import load_iris
    from sklearn.svm import LinearSVC
    
    data = load_iris()
    
    # remove a feature to make the problem harder
    # remove the third class for simplicity
    X = data.data[:100, 0:1] 
    y = data.target[:100] 
    # shuffle data
    indices = np.arange(y.shape[0])
    np.random.shuffle(indices)
    X = X[indices, :]
    y = y[indices]
    
    for i in range(1, 20):
        clf = LinearSVC(class_weight={0: 1, 1: i})
        clf = clf.fit(X[:50, :], y[:50])
        print i, Counter(clf.predict(X[50:]))
        # print clf.decision_function(X[50:])
    

    Which outputs

    1 Counter({1: 22, 0: 28})
    2 Counter({1: 31, 0: 19})
    3 Counter({1: 39, 0: 11})
    4 Counter({1: 43, 0: 7})
    5 Counter({1: 43, 0: 7})
    6 Counter({1: 44, 0: 6})
    7 Counter({1: 44, 0: 6})
    8 Counter({1: 44, 0: 6})
    9 Counter({1: 47, 0: 3})
    10 Counter({1: 47, 0: 3})
    11 Counter({1: 47, 0: 3})
    12 Counter({1: 47, 0: 3})
    13 Counter({1: 47, 0: 3})
    14 Counter({1: 47, 0: 3})
    15 Counter({1: 47, 0: 3})
    16 Counter({1: 47, 0: 3})
    17 Counter({1: 48, 0: 2})
    18 Counter({1: 48, 0: 2})
    19 Counter({1: 48, 0: 2})
    

    Note how the number of data points classified as 0 decreases are the relative weight of class 1 increases. Assuming you have the computational resources and time to train and evaluate 10 classifiers, you can plot the precision and recall of each one and get a figure like the one below (shamelessly stolen off the internet). You can then use that to decide what the right value of class_weights is for your use case.

    Precision-recall tradeoff

    0 讨论(0)
提交回复
热议问题