Learning Weka on the Command Line

前端 未结 2 1991
小蘑菇
小蘑菇 2020-12-24 00:25

I am fairly new to Weka and even more new to Weka on the command line. I find documentation is poor and I am struggling to figure out a few things to do. For example, want

相关标签:
2条回答
  • 2020-12-24 01:12

    A better way to do all that you want to use the GUI Explorer. Here is how to do all that you want:

    1) Take two separate files for training and testing.

    Use 'Open File' under the Preprocess tab to choose your training file. Use 'Supplied Test Set' radio under the Classify tab to choose your test file.

    2) Output the predictions for the missing labels.

    Use 'More Options' and choose 'Output Predictions' under the Classify tab to see predictions.

    3) Use more than one filters

    Use 'Filter' under the Preprocess tab to apply as many filters as you want before classifying.

    4) Make class the last attribute

    This is actually unnecessary. You can choose any attribute to be your class. A class is any attribute that you want the classifier to predict. Use the Nom(Class) dropdown on the Classify tab to choose which attribute is your class.

    0 讨论(0)
  • 2020-12-24 01:17

    Weka is not really the shining example of documentation, but you can still find valuable information about it on their sites. You should start with the Primer. I understand that you want to classify text files, so you should also have a look at Text categorization with WEKA. There is also a new Weka documentation site.

    [Edit: Wikispaces has shut down and Weka hasn't brought up the sites somewhere else, yet, so I've modified the links to point at the Google cache. If someone reads this and a new Weka Wiki is up, feel free to edit the links and remove this note.]

    The command line you posted in your question contains an error. I know, you copied it from my answer to another question, but I also just noticed it. You have to omit the -- -c last, because the ReplaceMissingValue filter doesn't like it.

    In the Primer it says:

    weka.filters.supervised

    Classes below weka.filters.supervised in the class hierarchy are for supervised filtering, i.e. taking advantage of the class information. A class must be assigned via -c, for WEKA default behaviour use -c last.

    but ReplaceMissingValue is an unsupervised filter, as is StringToWordVector.

    Multiple filters

    Adding multiple filter is also no problem, that is what the MultiFilter is for. The command line can get a bit messy, though: (I chose RandomForest here, because it is a lot faster than NN).

    java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
      -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
      -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
     -F "weka.filters.MultiFilter \
         -F weka.filters.unsupervised.attribute.StringToWordVector \
         -F weka.filters.unsupervised.attribute.Standardize" \
     -W weka.classifiers.trees.RandomForest -- -I 100 \
    

    Making predictions

    Here is what the Primer says about getting the prediction:

    However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none).

    It is a good convention to put those general options like -p 0 directly after the class you're calling, so the command line would be

    java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
      -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
      -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
      -p 0 \
     -F "weka.filters.MultiFilter \
         -F weka.filters.unsupervised.attribute.StringToWordVector \
         -F weka.filters.unsupervised.attribute.Standardize" \
     -W weka.classifiers.trees.RandomForest -- -I 100 \
    

    Structure of WEKA classifiers/filters

    But as you can see, WEKA can get very complicated when calling it from the command line. This is due to the tree structure of WEKA classifiers and filters. Though you can run only one classifier/filter per command line, it can be structured as complex as you like. For the above command, the structure looks like this:

    The FilteredClassifier will initialize a filter on the training data set, filter both training and test data, then train a model on the training data and classify the given test data.

    FilteredClassifier
     |
     + Filter
     |
     + Classifier
    

    If we want multiple filters, we use the MultiFilter, which is only one filter, but it calls multiple others in the order they were given.

    FilteredClassifier
     |
     + MultiFilter
     |  |
     |  + StringToWordVector
     |  |
     |  + Standardize
     |
     + RandomForest
    

    The hard part of running something like this from the command line is assigning the desired options to the right classes, because often the option names are the same. For example, the -F option is used for the FilteredClassifier and the MultiFilter as well, so I had to use quotes to make it clear which -F belongs to what filter.

    In the last line, you see that the option -I 100, which belongs to the RandomForest, can't be appended directly, because then it would be assigned to FilteredClassifier and you will get Illegal options: -I 100. Hence, you have to add -- before it.

    Adding predictions to the data files

    Adding the predicted class label is also possible, but even more complicated. AFAIK this can't be done in one step, but you have to train and save a model first, then use this one for predicting and assigning new class labels.

    Training and saving the model:

    java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
      -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
      -d rf.model \
      -F "weka.filters.MultiFilter \
          -F weka.filters.unsupervised.attribute.StringToWordVector \
          -F weka.filters.unsupervised.attribute.Standardize" \
      -W weka.classifiers.trees.RandomForest -- -I 100 \
    

    This will serialize the model of the trained FilteredClassifier to the file rf.model. The important thing here is that the initialized filter will also be serialized, otherwise the test set wouldn't be compatible after filtering.

    Loading the model, making predictions and saving it:

    java -classpath weka.jar weka.filters.supervised.attribute.AddClassification \
      -serialized rf.model \
      -classification \
      -remove-old-class \
      -i ~/weka-3-7-9/data/ReutersCorn-test.arff \
      -o pred.arff \
      -c last
    
    0 讨论(0)
提交回复
热议问题