I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input.
this is my training data:
@relation hamspam
@attribute text string
@attribute class {spam,ham}
@data
'good',ham
'good',ham
'very good',ham
'bad',spam
'very bad',spam
'very bad, very bad',spam
'good good bad',ham
this is my testing_data:
@relation test
@attribute text string
@attribute class {spam,ham}
@data
'good bad very bad',?
'good bad very bad',?
'good',?
'good very good',?
'bad',?
'very good',?
'very very good',?
and this is my code:
public static void NaiveBayes(String training_file, String testing_file) throws FileNotFoundException, IOException, Exception{
//filter
StringToWordVector filter = new StringToWordVector();
Classifier naive = new NaiveBayes();
//training data
Instances train = new Instances(new BufferedReader(new FileReader(training_file)));
int lastIndex = train.numAttributes() - 1;
train.setClassIndex(lastIndex);
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
//testing data
Instances test = new Instances(new BufferedReader(new FileReader(testing_file)));
test.setClassIndex(lastIndex);
filter.setInputFormat(test);
Instances test2 = Filter.useFilter(test, filter);
naive.buildClassifier(train);
for(int i=0; i<test2.numInstances(); i++) {
System.out.println(test.instance(i));
double index = naive.classifyInstance(test2.instance(i));
String className = train.attribute(0).value((int)index);
System.out.println(className);
}
}
The result indicate that the data that should have been classified into class spam classified into class ham, and the data that should have been classified into class ham classified into class spam. what's the problem?, help me please..
Your code seems fine, though i have two comments to make.
- First, you set filter's format with this command
filter.setInputFormat(train);
so as to use this filter and make test and train data compatible. You should not change the format again with this command:filter.setInputFormat(test);
as this might create compatibility issues. - Also instead of getting the first attribute:
train.attribute(0).value((int)index);
(which seems to me that is not corresponds to class attribute) try using this commandtrain.classAttribute().value((int)index);
P.S. Check Load naïve Bayes model in Java code using weka jar for a complete workflow and explanation of a classification example (the material was once in SO Documentation). This example is using the LibLinear classifier but the logic is the same.
来源:https://stackoverflow.com/questions/41935193/simple-text-classification-using-naive-bayes-weka-in-java