问题
I'm trying to classify an unlabelled string using Weka, I'm not an expert in data mining so i have been struggling with the different terms. What I'm doing is I am providing the training data and setting the unlabeled string after running the M5Rules classifier, I'm actually getting an output but i have no idea what it mean:
run:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
Results
======
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
BUILD SUCCESSFUL (total time: 1 second)
The source code is as follows:
public Categorizer(){
try{
//*** READ ARRF FILES *///////////////////////////////////////////////////////
//BufferedReader trainReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/training-data.arff"));//File with text examples
//BufferedReader classifyReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/test-data.arff"));//File with text to classify
// Create trainning data instance
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/training-data"));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataTraining = Filter.useFilter(dataRaw, filter);
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
// Create test data instances
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/test-data"));
dataRaw = loader.getDataSet();
Instances dataTest = Filter.useFilter(dataRaw, filter);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
// Classify
FilteredClassifier model = new FilteredClassifier();
model.setFilter(new StringToWordVector());
model.setClassifier(new M5Rules());
model.buildClassifier(dataTraining);
for (int i = 0; i < dataTest.numInstances(); i++) {
dataTest.instance(i).setClassMissing();
double cls = model.classifyInstance(dataTest.instance(i));
dataTest.instance(i).setClassValue(cls);
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(dataTraining);
eval.evaluateModelOnce(cls, dataTest.instance(i));
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
catch(FileNotFoundException e){
System.err.println(e.getMessage());
}
catch(IOException i){
System.err.println(i.getMessage());
}
catch(Exception o){
System.err.println(o.getMessage());
}
}
And finally a couple of screenshots in case i made anything wrong in the folder hierarchy:
回答1:
tl;dr:
- You set the class index to a random feature
- You have to use a classifier, not a regression algorithm
The problem is how you initialize your data sets. Although weka usually puts the class in the last column, the TextDirectoryLoader doesn't. In fact, you don't need to set the class index manually, it is already set, so remove the lines
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
(The first line is wrong anyway, because you use the number of attributes from the raw data set, but choose the column of the already filtered data set.)
If you then run your code, you will get this:
weka.classifiers.functions.LinearRegression: Cannot handle binary class!
As I already guessed, M5Rules is not a classifier, but for regression. If you use a classifier like J48
or RandomForest
, you will get a more sensible output. Just change the line
model.setClassifier(new M5Rules());
to
model.setClassifier(new RandomForest());
As for your output, here is what I make of it:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
is the result of the lines
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
So you see the features of your instance serialized as sparse ARFF followed by |
and the class.
Usually, the class should be an integer, but from the documentation of M5Rules I get that it is a classifier for regression problems, so you won't get discrete classes, but continuous values, in your case -0.03816793850062397
Since you (incorrectly) set a numerical feature as class label, M5Rules didn't complain and gave you an output. If you use an actual classifier, you will get your labels "health" or "travel".
The rest are standard statistics about the classifiers performance, but they are pretty useless for only one classifier instance. It looks like the one sample was classified correctly, so all errors are zero.
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
回答2:
Just in case someone else got the same error with M5P, try to see if the Arff is just a header or empty.
Otherwise try
model.buildClassifier(....)
instead of
model.setClassifier(....);
That solved it for me.
来源:https://stackoverflow.com/questions/15280072/weka-classification-and-predicted-class