How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

后端 未结 2 1188
我寻月下人不归
我寻月下人不归 2021-02-04 11:04

I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and

2条回答
  •  长情又很酷
    2021-02-04 11:15

    Bootstrap for 95% confidence interval

    You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:

    def bootstrap(x, f, nsamples=1000):
        stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
        return np.percentile(stats, (2.5, 97.5))
    

    This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).

    In your case, you can write a more specific function like this

    def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
        auc_values = []
        for b in range(nsamples):
            idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
            clf.fit(X_train[idx], y_train[idx])
            pred = clf.predict_proba(X_test)[:, 1]
            roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
            auc_values.append(roc_auc)
        return np.percentile(auc_values, (2.5, 97.5))
    

    Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.

    This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):

    • Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
    • Random Forest: 0.983 [0.974 0.989]
    • Multilayer Perceptron: 0.974 [0.223 0.98]

    Permutation tests to test against chance performance

    A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values

    def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
        idx1 = np.arange(X_train.shape[0])
        idx2 = np.arange(X_test.shape[0])
        auc_values = np.empty(nsamples)
        for b in range(nsamples):
            np.random.shuffle(idx1)  # Shuffles in-place
            np.random.shuffle(idx2)
            clf.fit(X_train, y_train[idx1])
            pred = clf.predict_proba(X_test)[:, 1]
            roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
            auc_values[b] = roc_auc
        clf.fit(X_train, y_train)
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
        return roc_auc, np.mean(auc_values >= roc_auc)
    

    This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).

    Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.

    Permutation test for differences between classifiers

    This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this

    def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
        auc_differences = []
        auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
        auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
        observed_difference = auc1 - auc2
        for _ in range(nsamples):
            mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
            p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
            p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
            auc1 = roc_auc_score(y_test.ravel(), p1)
            auc2 = roc_auc_score(y_test.ravel(), p2)
            auc_differences(auc1 - auc2)
        return observed_difference, np.mean(auc_differences >= observed_difference)
    

    With this test and 1000 samples, I find no significant differences between the three classifiers:

    • Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
    • Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
    • random forest vs MLP: diff=0.0088, p(diff>)=0.203

    Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.

提交回复
热议问题