WEKA-generated models does not seem to predict class and distribution given the attribute index

问题

Overview

I am using the WEKA API 3.7.10 (developer version) to use my pre-made .model files.

I made 25 models: five outcome variables for five algorithms.

J48 decision tree.
Alternating decision tree
Random forest
LogitBoost
Random subspace

I am having problems with J48, Random subspace and random forest.

Necessary files

The following is the ARFF representation of my data after creation:

@relation WekaData

@attribute ageDiagNum numeric
@attribute raceGroup {Black,Other,Unknown,White}
@attribute stage3 {0,I,IIA,IIB,IIIA,IIIB,IIIC,IIINOS,IV,'UNK Stage'}
@attribute m3 {M0,M1,MX}
@attribute reasonNoCancerSurg {'Not performed, patient died prior to recommended surgery','Not recommended','Not recommended, contraindicated due to other conditions','Recommended but not performed, patient refused','Recommended but not performed, unknown reason','Recommended, unknown if performed','Surgery performed','Unknown; death certificate or autopsy only case'}
@attribute ext2 {00,05,10,11,13,14,15,16,17,18,20,21,23,24,25,26,27,28,30,31,33,34,35,36,37,38,40,50,60,70,80,85,99}
@attribute time2 {}
@attribute time4 {}
@attribute time6 {}
@attribute time8 {}
@attribute time10 {}

@data
65,White,IIA,MX,'Not recommended, contraindicated due to other conditions',14,?,?,?,?,?

I need to get the binary attributes time2 to time10 from their respective models.

Below are snippets of the code I use to get the predictions from all the model files:

private static Map<String, Object> predict(Instances instances,
        Classifier classifier, int attributeIndex) {
    Map<String, Object> map = new LinkedHashMap<String, Object>();
    int instanceIndex = 0; // do not change, equal to row 1
    double[] percentage = { 0 };
    double outcomeValue = 0;
    AbstractOutput abstractOutput = null;

    if(classifier.getClass() == RandomForest.class || classifier.getClass() == RandomSubSpace.class) {
        // has problems predicting time2 to time10
        instances.setClassIndex(5); 
    } else {
        // works as intended in LogitBoost and ADTree
        instances.setClassIndex(attributeIndex);    
    }

    try {
        outcomeValue = classifier.classifyInstance(instances.instance(0));
        percentage = classifier.distributionForInstance(instances
                .instance(instanceIndex));
    } catch (Exception e) {
        e.printStackTrace();
    }

    map.put("Class", outcomeValue);

    if (percentage.length > 0) {
        double percentageRaw = 0;
        if (outcomeValue == new Double(1)) {
            percentageRaw = percentage[1];
        } else {
            percentageRaw = 1 - percentage[0];
        }
        map.put("Percentage", percentageRaw);
    } else {
        // because J48 returns an error if percentage[i] because it's empty
        map.put("Percentage", new Double(0));
    }

    return map;
}

Here are the models I use to predict outcome for time2 hence we will use index 6:

instances.setClassIndex(5);

ADTree model for time2 prediction
J48 model for time2 prediction
RandomForest model for time2 prediction
LogitBoost model for time2 prediction
RandomSubSpace model for time2 prediction

Problems

As I said before, LogitBoost and ADTree have no problem in this straightforward method compared to the other three, as I followed the "Use WEKA in your Java code" tutorial.
[Solved] Based from my tweakings, RandomForest and RandomSubSpace returns an ArrayOutOfBoundsException if told to predict time2 to time10.
```
java.lang.ArrayIndexOutOfBoundsException: 0
    at weka.classifiers.meta.Bagging.distributionForInstance(Bagging.java:586)
    at weka.classifiers.trees.RandomForest.distributionForInstance(RandomForest.java:602)
    at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)
```
The stack trace points the root error to the line:
```
outcomeValue = classifier.classifyInstance(instances.instance(0));
```
Solution: I had some copy-paste error during the ARFF file creation for the binary variables time2 to time10 regarding FastVector<String>()'s assignment of values to the FastVector<Attribute>() object. All ten of my RandomForest and RandomSubSpace models are working fine right now!

[Solved] J48 decision tree has a new problem now. Instead of not providing any predictions, it now returns an error:

java.lang.ArrayIndexOutOfBoundsException: 11
    at weka.core.DenseInstance.value(DenseInstance.java:332)
    at weka.core.AbstractInstance.isMissing(AbstractInstance.java:315)
    at weka.classifiers.trees.j48.C45Split.whichSubset(C45Split.java:494)
    at weka.classifiers.trees.j48.ClassifierTree.getProbs(ClassifierTree.java:670)
    at weka.classifiers.trees.j48.ClassifierTree.classifyInstance(ClassifierTree.java:231)
    at weka.classifiers.trees.J48.classifyInstance(J48.java:266)

and it traces to the line

outcomeValue = classifier.classifyInstance(instances.instance(0));

Solution: actually I randomly ran the program with J48 and it worked - giving the outcome variable and associated distributions.

I hope someone can help me sort out this issue. I really do not know what is wrong with this code as I have checked the Javadocs and examples online and the constant predictions are still persistent.

(I am currently checking the main program for the WEKA GUI but please help me out here :-) )

回答1:

I've only looked at the RandomForest problem for now. It is because the Bagging class extracts the number of different classes from the data instance itself, and not from the model. You say in your text that time2 to time10 are binary, but you just don't say it in your ARFF file, and so the Bagging class has no clue about how many classes there are.

So you just have to specify in your ARFF file that time2 is binary, e.g.: @attribute time2 {0,1}

and you won't get any Exception any more.

I've not looked at the J48 problem, because it may be the same issue with ARFF definition.

Test code:

  public static void main(String [] argv) {
      try {
        Classifier cls = (Classifier) weka.core.SerializationHelper.read("bosom.100k.2.j48.MODEL");
        J48 c = (J48)cls;

        DataSource source = new DataSource("data.arff");
        Instances data = source.getDataSet();
        data.setClassIndex(6);        

        try {
            double outcomeValue = c.classifyInstance(data.instance(0));
            System.out.println("outcome "+outcomeValue);
            double[] p = c.distributionForInstance(data.instance(0));
            System.out.println(Arrays.toString(p));
        } catch (Exception e) {
            e.printStackTrace();
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

来源：https://stackoverflow.com/questions/21808033/weka-generated-models-does-not-seem-to-predict-class-and-distribution-given-the

标签

java

machine-learning

weka

decision-tree

prediction