Open NLP Name Finder Training

旧街凉风 提交于 2019-12-06 04:09:54

问题


I am building a 15k line training data document called: en-ner-person.train per the online manual (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html).

My question is: in my training document, do I include an entire report? Or do I only include the lines which have a name: <START:person> John Smith <END>?

So for example do I use this entire report in my training data:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
A nonexecutive  director has many similar responsibilities as an executive director.
However, there are no voting rights with this position.
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

Or do I only include these two lines in my training document:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

回答1:


You should use the entire report. This would help the system to learn when not to mark an entity, improving false negatives score.

You can measure it using the evaluation tool. Reserve some sentences of your corpus for testing, for example 1/10 of the total, and train your model using the other 9/10 sentences. You can try training using the entire report and another with only the sentences with names. The results will be in terms of precision and recall.

Remember to keep the test sample with the entire report, not only the sentences with names, otherwise you will not have an accurate measure of how the model would perform with sentences without names.




回答2:


I would include everything even though all of it may not contribute to the weights in the trained model.

What is or isn't used from training file is determined by the feature generator used to train the model. If you get to the point where you are actually tweaking the feature generator then you at least wouldn't need to re-build your training file if it already included everything.

This example feature generator from the documentation also happens to be the default one in the code that is used for name finders: Custom Feature Generation

AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
         new AdaptiveFeatureGenerator[]{
           new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
           new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
           new OutcomePriorFeatureGenerator(),
           new PreviousMapFeatureGenerator(),
           new BigramNameFeatureGenerator(),
           new SentenceFeatureGenerator(true, false)
           });

I can't fully explain that glob of code, and haven't found good documentation on it or waded through the source to understand it but the WindowFeatureGenerators there take into account the tokens and the classes of the tokens (e.g. if that token was already labeled as a person) +/-2 positions before and after the token being examined.

As such, it is possible that tokens in a sentence that doesn't contain an entity may have an impact on a sentence that does. By cropping out the extra sentences you may be training your model with unnatural patterns like a sentence ending with a name followed by a sentence that begins with the a name like this:

The car fell on <START:person> Pierre Vinken <END>. <START:person> Pierre Vinken<END> is the chairman.


来源:https://stackoverflow.com/questions/11335013/open-nlp-name-finder-training

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!