Weka - binary classification giving polarized/biased results

问题

Let me say, first up, that I'm a WEKA newbie.

I'm using WEKA for a binary classification problem where certain metrics are being used to get a yes/no answer for the instances.

To exemplify the issue, here's the confusion matrix I got for a set with 288 instances, with 190 'yes' and 98 'no' values using BayesNet:

  a   b   <-- classified as
190   0 |   a = yes
 98   0 |   b = no

This absolute separation is the case with some other classifiers as well, but not with all of them. That said, even if classifiers don't have values polarized to such a degree, they do have a definite bias for the predominant class. For example, here's the result with RandomForest:

  a   b   <-- classified as
164  34 |   a = yes
 62  28 |   b = no

I'm pretty certain I'm missing something very obvious.

回答1:

Originally, I thought that BayesNet is the problem. But now I think it is your data.

As it was already pointed out in the comments, I thought the problem is with the unbalanced classes. Most classifiers optimize for accuracy, which in your case is (190 + 0) / 288 = 0.66 for the BayesNet and (164 + 28) / 288 = 0.67 for the RandomForest.

As you can see, the difference is not that big, but the solution found by RandomForest is marginally better. It looks "better" because it doesn't put everything in the same class, but I really doubt it is statistically significant.

Like Lars Kotthoff mentioned, it is hard to say. I'd also guess that the features are just not good enough for a better separation.

In addition to trying other classifiers you should reconsider your performance measure. Accuracy is only good if you have approximately the same number of instances for each class. In other cases, MCC or AUC are good choices (but AUC won't work with LibSVM in WEKA due to incompatible implementations).

The MCC for your examples would be 0 for the BayesNet and

  ((164*28) - (62*34)) / sqrt((164+62)*(34+28)*(164+34)*(62+28))
= (4592 - 2108) / sqrt(226 * 62 * 198 * 90)
= 2484 / sqrt(249693840)
= 0,15719823927071640929

for RandomForest. So RandomForest shows a slightly better result, but not that much better.

Hard to tell without seeing your data, but they are probably not well separable.

来源：https://stackoverflow.com/questions/15479779/weka-binary-classification-giving-polarized-biased-results

标签

classification

weka