I am fairly new to Weka and even more new to Weka on the command line. I find documentation is poor and I am struggling to figure out a few things to do. For example, want
A better way to do all that you want to use the GUI Explorer. Here is how to do all that you want:
1) Take two separate files for training and testing.
Use 'Open File' under the Preprocess tab to choose your training file. Use 'Supplied Test Set' radio under the Classify tab to choose your test file.
2) Output the predictions for the missing labels.
Use 'More Options' and choose 'Output Predictions' under the Classify tab to see predictions.
3) Use more than one filters
Use 'Filter' under the Preprocess tab to apply as many filters as you want before classifying.
4) Make class the last attribute
This is actually unnecessary. You can choose any attribute to be your class. A class is any attribute that you want the classifier to predict. Use the Nom(Class) dropdown on the Classify tab to choose which attribute is your class.
Weka is not really the shining example of documentation, but you can still find valuable information about it on their sites. You should start with the Primer. I understand that you want to classify text files, so you should also have a look at Text categorization with WEKA. There is also a new Weka documentation site.
[Edit: Wikispaces has shut down and Weka hasn't brought up the sites somewhere else, yet, so I've modified the links to point at the Google cache. If someone reads this and a new Weka Wiki is up, feel free to edit the links and remove this note.]
The command line you posted in your question contains an error. I know, you copied it from my answer to another question, but I also just noticed it. You have to omit the -- -c last
, because the ReplaceMissingValue
filter doesn't like it.
In the Primer it says:
weka.filters.supervised
Classes below weka.filters.supervised in the class hierarchy are for supervised filtering, i.e. taking advantage of the class information. A class must be assigned via -c, for WEKA default behaviour use
-c last
.
but ReplaceMissingValue
is an unsupervised filter, as is StringToWordVector
.
Adding multiple filter is also no problem, that is what the MultiFilter
is for. The command line can get a bit messy, though: (I chose RandomForest
here, because it is a lot faster than NN).
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-T ~/weka-3-7-9/data/ReutersCorn-test.arff \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
Here is what the Primer says about getting the prediction:
However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none).
It is a good convention to put those general options like -p 0
directly after the class you're calling, so the command line would be
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-T ~/weka-3-7-9/data/ReutersCorn-test.arff \
-p 0 \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
But as you can see, WEKA can get very complicated when calling it from the command line. This is due to the tree structure of WEKA classifiers and filters. Though you can run only one classifier/filter per command line, it can be structured as complex as you like. For the above command, the structure looks like this:
The FilteredClassifier will initialize a filter on the training data set, filter both training and test data, then train a model on the training data and classify the given test data.
FilteredClassifier
|
+ Filter
|
+ Classifier
If we want multiple filters, we use the MultiFilter, which is only one filter, but it calls multiple others in the order they were given.
FilteredClassifier
|
+ MultiFilter
| |
| + StringToWordVector
| |
| + Standardize
|
+ RandomForest
The hard part of running something like this from the command line is assigning the desired options to the right classes, because often the option names are the same. For example, the -F
option is used for the FilteredClassifier
and the MultiFilter
as well, so I had to use quotes to make it clear which -F belongs to what filter.
In the last line, you see that the option -I 100
, which belongs to the RandomForest
, can't be appended directly, because then it would be assigned to FilteredClassifier
and you will get Illegal options: -I 100
. Hence, you have to add --
before it.
Adding the predicted class label is also possible, but even more complicated. AFAIK this can't be done in one step, but you have to train and save a model first, then use this one for predicting and assigning new class labels.
Training and saving the model:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-d rf.model \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
This will serialize the model of the trained FilteredClassifier
to the file rf.model
. The important thing here is that the initialized filter will also be serialized, otherwise the test set wouldn't be compatible after filtering.
Loading the model, making predictions and saving it:
java -classpath weka.jar weka.filters.supervised.attribute.AddClassification \
-serialized rf.model \
-classification \
-remove-old-class \
-i ~/weka-3-7-9/data/ReutersCorn-test.arff \
-o pred.arff \
-c last