Open NLP NER is not properly trained

為{幸葍}努か 提交于 2019-12-11 05:25:27

问题


I tried to train a custom model for NER using openNlp. When I pass a sentence to predict the Entity, It just picks the first word of the sentence. Don't know where I am going wrong,.

Please find the training model code below,

public class OpenNLPNER {
    public static void main(String[] args) {
        train("en", "technology", "D:\\dl4j-examples-master\\dl4j-examples-master\\dl4j-examples\\src\\main\\java\\opennlpExamples\\src\\main\\resources\\technology.train", "D:\\dl4j-examples-master\\dl4j-examples-master\\dl4j-examples\\src\\main\\java\\opennlpExamples\\src\\main\\techno1.bin");
    }

    public static String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) {

        Charset charset = Charset.forName("UTF-8");
        TokenNameFinderModel model = null;
        ObjectStream<NameSample> sampleStream = null;
        try {
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset);
            sampleStream = new NameSampleDataStream(lineStream);
            TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory();
            model = NameFinderME.train("en", "technology", sampleStream, TrainingParameters.defaultParams(),
                nameFinderFactory);
        } catch (FileNotFoundException fio) {

        } catch (IOException io) {

        } finally {
            try {
                sampleStream.close();
            } catch (IOException io) {

            }
        }
        BufferedOutputStream modelOut = null;
        try {
            modelOut = new BufferedOutputStream(modelStream);
            model.serialize(modelOut);
        } catch (IOException io) {

        } finally {
            if (modelOut != null) {
                try {
                    modelOut.close();
                } catch (IOException io) {

                }
            }
        }
        return "Something goes wrong with training module.";
    }

    public static String train(String lang, String entity, String taggedCoprusFile,
                               String modelFile) {
        try {
            InputStreamFactory inputStream = new InputStreamFactory() {
                FileInputStream fileInputStream = new FileInputStream("D:\\dl4j-examples-master\\dl4j-examples-master\\dl4j-examples\\src\\main\\java\\opennlpExamples\\src\\main\\resources\\technology.train");

                public InputStream createInputStream() throws IOException {
                    return fileInputStream;
                }
            };
            // InputStreamFactory temp= new InputStream("D:\\dl4j-examples-master\\dl4j-examples-master\\dl4j-examples\\src\\main\\java\\opennlpExamples\\src\\main\\resources\\en-ner-medical.train") ;
            return train(lang, entity, inputStream,
                new FileOutputStream(modelFile));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "Something goes wrong with training module.";
    }
}

Now loading the saved the model, When i pass a sentence to predict the output, It picks only the 1st word and only if the first letter of the first word is in caps.

find the load model and predict code below,

public class nameEntity {
    public static void main(String[] args) throws Exception {
        InputStream modelIn = new FileInputStream( "D:/main/techno.bin");
        InputStream tokenModelIn = new FileInputStream( "C:/openNLP/en-
        token.bin");
        try {
            TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
            NameFinderME nameFinder = new NameFinderME(model);
            //Instantiating the NameFinder class
            //nameFinder = new NameFinderME(model);
        TokenizerModel tokenModel = new TokenizerModel(tokenModelIn);

        //Instantiating the TokenizerME class
        TokenizerME tokenizer = new TokenizerME(tokenModel);

        //Getting the sentence in the form of String array
            String sentence = "Camel is a Java software";

        String tokens[] = tokenizer.tokenize(sentence);

        //Finding the names in the sentence
        nameFinder.clearAdaptiveData();
            Span nameSpans[] = nameFinder.find(tokens);
            System.out.println(sentence);
            //Printing the spans of the names in the sentence
            for(Span s: nameSpans) {
                System.out.println(s.toString()+"  "+tokens[s.getStart()]);
            }

    }
}

train file:

Abdera implementation of the Atom Syndication Format and Atom Publishing Protocol, Accumulo secure implementation of BigTable, ActiveMQ message broker supporting different communication protocols and clients, including a full Java Message Service (JMS) 1.1 client. Allura Python-based an open source implementation of a software forge. Ant Java-based build tool, Apache Arrow "A high-performance cross-system data layer for columnar in-memory analytics". APR Apache Portable Runtime, a portability library written in C, Archiva Build Artifact Repository Manager, Apache Beam, an uber-API for big data Beehive Java visual object model. Bloodhound defect tracker based on Trac[3]. Calcite dynamic data management framework, Camel declarative routing and mediation rules engine which implements the Enterprise Integration Patterns using a Java-based domain specific language.

Output When 1st word of the 1st letter is in caps: Is Camel a Java software [0..1) technology Is

Output When 1st word of the 1st letter is not in caps: camel is a Java software

Now what happens here is, If the 1st word is found in train file or not. the output is the 1st word of the sentence iff 1st letter of word is in caps.

tried using openNlp tool 1.6.0 & 1.7.2 version to train the model.

Please tell me, where can be the issue ? Am i missing any rules ??

Thanks in advance.

来源:https://stackoverflow.com/questions/44043876/open-nlp-ner-is-not-properly-trained

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!