OpenNLP: foreign names does not get recognized

霸气de小男生 提交于 2019-12-18 08:27:31

问题


I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?


回答1:


You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.

it is very new, but it works for me.

You feed it the following:

  • a list of "known entities" via a file where each line is a name
  • a list of sentences from YOUR data via file where each line is a sentence
  • (optionally) a blacklist to remove false positives

you can checkout the addon here

https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon

you can use this to get started

import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;

public class ModelBuilderAddonUse {

  public static void main(String[] args) {
    File fileOfSentences = new File("path to your sentence file");
    File fileOfNames = new File("path to your file of person names");
    File blackListFile = new File("path to your blacklist file");
    File modelOutFile = new File("path to you where the model will be saved");
    File annotatedSentencesOutFile = new File("path to your sentence file");

    DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);


  }
}

the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.



来源:https://stackoverflow.com/questions/20509678/opennlp-foreign-names-does-not-get-recognized

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!