Creating training data for a Maxent classfier in Java

柔情痞子 提交于 2019-12-04 07:46:27
Viliam Simko

If I understand it correctly, you are trying to treat sentences as a set of POS tags.

In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP). That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)

This can be encoded for OpenNLP Maxent as follows:

PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1

or simply:

PRP$ NN VBZ NNP CLASS=SomeClassOfYours1

(For working code-snippet see my answer here: Training models using openNLP maxent)

Some more sample data would be:

  1. "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
  2. "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
  3. "As soon as she moved out, the mobile home was demolished, the suit said."
  4. ...

This would yield samples:

IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...

However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.

Edited on 28.3.2016: You can also use the whole sentence as a training sample. However, be aware that: - two sentences might contain same words but have different meaning - there is a pretty high chance of overfitting - you should use short sentences - you need a huge training set

According to your example, I would encode the training samples as follows:

class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...

Notice that the outcome variable comes as the first element on each line.

Here is a fully working minimal example using opennlp-maxent-3.0.3.jar.


package my.maxent;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;

import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;

public class MaxentTest {


    public static void main(String[] args) throws IOException {

        String trainingFileName = "training-file.txt";
        String modelFileName = "trained-model.maxent.gz";

        // Training a model from data stored in a file.
        // The training file contains one training sample per line.
        DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName)); 
        MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations

        // Storing the trained model into a file for later use (gzipped)
        File outFile = new File(modelFileName);
        AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
        writer.persist();

        // Loading the gzipped model from a file
        FileInputStream inputStream = new FileInputStream(modelFileName);
        InputStream decodedInputStream = new GZIPInputStream(inputStream);
        DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
        MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();

        // Now predicting the outcome using the loaded model
        String[] context = {"is_VBZ", "Gaby_NNP"};
        double[] outcomeProbs = loadedMaxentModel.eval(context);

        String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
        System.out.println("=======================================");
        System.out.println(outcome);
        System.out.println("=======================================");
    }

}

And some dummy training data (stored as training-file.txt):

class=Male      My_PRP name_NN is_VBZ John_NNP
class=Male      My_PRP name_NN is_VBZ Peter_NNP
class=Female    My_PRP name_NN is_VBZ Anna_NNP
class=Female    My_PRP name_NN is_VBZ Gaby_NNP

This yields the following output:

Indexing events using cutoff of 0
Computing event counts...  done. 4 events
Indexing...  done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 4
        Number of Outcomes: 2
      Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-2.772588722239781  0.5
  2:  ... loglikelihood=-2.4410105407571203 1.0
      ...
 99:  ... loglikelihood=-0.16111520541752372    1.0
100:  ... loglikelihood=-0.15953272940719138    1.0
=======================================
class=Female
=======================================
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!