How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

后端未结

关注

 2  1187

I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and

相关标签:

2条回答

长情又很酷

2021-02-04 11:15
Bootstrap for 95% confidence interval

You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:
```
def bootstrap(x, f, nsamples=1000):
    stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
    return np.percentile(stats, (2.5, 97.5))
```
This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).

In your case, you can write a more specific function like this
```
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    auc_values = []
    for b in range(nsamples):
        idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
        clf.fit(X_train[idx], y_train[idx])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
        auc_values.append(roc_auc)
    return np.percentile(auc_values, (2.5, 97.5))
```
Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.

This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):
- Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
- Random Forest: 0.983 [0.974 0.989]
- Multilayer Perceptron: 0.974 [0.223 0.98]
Permutation tests to test against chance performance

A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values
```
def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    idx1 = np.arange(X_train.shape[0])
    idx2 = np.arange(X_test.shape[0])
    auc_values = np.empty(nsamples)
    for b in range(nsamples):
        np.random.shuffle(idx1)  # Shuffles in-place
        np.random.shuffle(idx2)
        clf.fit(X_train, y_train[idx1])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
        auc_values[b] = roc_auc
    clf.fit(X_train, y_train)
    pred = clf.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
    return roc_auc, np.mean(auc_values >= roc_auc)
```
This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).

Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.

Permutation test for differences between classifiers

This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this
```
def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
    auc_differences = []
    auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
    auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
    observed_difference = auc1 - auc2
    for _ in range(nsamples):
        mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
        p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
        p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
        auc1 = roc_auc_score(y_test.ravel(), p1)
        auc2 = roc_auc_score(y_test.ravel(), p2)
        auc_differences(auc1 - auc2)
    return observed_difference, np.mean(auc_differences >= observed_difference)
```
With this test and 1000 samples, I find no significant differences between the three classifiers:
- Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
- Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
- random forest vs MLP: diff=0.0088, p(diff>)=0.203
Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2021-02-04 11:15
One can use the code given below to compute the AUC and asymptotic normally distributed confidence interval for Neural Nets.
```
tf.contrib.metrics.auc_with_confidence_intervals(
labels,
predictions,
weights=None,
alpha=0.95,
logit_transformation=True,
metrics_collections=(),
updates_collections=(),
name=None)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

Bootstrap for 95% confidence interval

Permutation tests to test against chance performance

Permutation test for differences between classifiers