问题
I'm trying some tests on weka, hope someone can help me and i can made myself clear.
Step 1: Tokenize my data
@attribute text string
@attribute @@class@@ {derrota,empate,win}
@data
'O Grêmio perdeu para o Cruzeiro por 1 a 0',derrota
'O Grêmio venceu o Palmeiras em um grande jogo de futebol, nesta quarta-feira na Arena',vitoria
Step 2: Build model on tokenized data
After loading this i apply a StringToWordVector. After applying this filter i save a new arff file with the words tokenized. Something like..
@attribute @@class@@ {derrota,vitoria,win}
@attribute o numeric
@attribute grêmio numeric
@attribute perdeu numeric
@attribute venceu numeric
@ and so on .....
@data
{0 derrota, 1 1, 2 1, 3 1, 4 0, ...}
{0 vitoria, 1 1, 2 1, 3 0, 4 1, ...}
Ok! Now based on this arff i build my classifier model and save it.
Step 3: Test with "simulated new data"
If i want to test my model with "simulated new data" what im doing actually is editing this last arff and making a line like
{0 ?, 1 1, 2 1, 3 1, 4 0, ...}
Step 4(my problem): How to test with really new data
So far so good. My problem is when i need to use this model with 'really' new data. For example, if i have a string with "O Grêmio caiu diante do Palmeiras". I have 4 new words that doesn't exist in my model and 2 that exist.
How can i create a arff file with this new data that can be fitted in my model? (ok i know that the 4 new words will not be present, but how can i work with this?)
After supply a different test data the following message appears
回答1:
If you use Weka programmatically then you can do this fairly easy.
- Create your training file (e.g training.arff)
- Create Instances from training file.
Instances trainingData = ..
- Use StringToWordVector to transform your string attributes to number representation:
sample code:
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData:
trainingData = Filter.useFilter(trainingData, filter);
Select a classifier to create your model
sample code for LibLinear classifier
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
- Save model
sample code
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create a testing file (e.g testing.arff)
Create Instances from training file:
Instances testingData=...
Load classifier
sample code
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:
filter.setInputFormat(trainingData);
This will keep the format of training set and will not add words that are not in training set.Apply the filter to testingData:
testingData = Filter.useFilter(testingData, filter);
Classify!
sample code
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}
- Not sure if this can be done through GUI.
- Save and load steps are optional.
Edit: after some digging in the Weka GUI i think it is possible to do it. In the classify tab set your testing set at the Supply test set field. After that your sets should normally be incompatible. To fix this click yes in the following dialog
and you are good to go.
来源:https://stackoverflow.com/questions/40223798/how-to-use-created-model-with-new-data-in-weka