Formatting NER output from Stanford Corenlp

前端 未结 4 1830
耶瑟儿~
耶瑟儿~ 2021-01-15 16:22

I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is \"NE

相关标签:
4条回答
  • 2021-01-15 17:08

    If you just want the complete strings of each named entity found by Stanford NER, try this:

    String text = "<INSERT YOUR INPUT TEXT HERE>";
    AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
    List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
    for (Triple<String, Integer, Integer> entity : entities)
        System.out.println(text.substring(entity.second, entity.third), entity.second));
    

    In case you're wondering, the entity class is indicated by entity.first.

    Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

    0 讨论(0)
  • 2021-01-15 17:19

    No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator which handles this merging for you. For now, you can (a) use the MentionsAnnotator, available on the CoreNLP master branch, or (b) merge manually.

    Use the -outputFormat xml option to have CoreNLP output XML. (Is this what you want?)

    0 讨论(0)
  • 2021-01-15 17:21

    You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.

        <tokens> 
          <token id="1"> 
            <word>New</word> 
            <lemma>New</lemma> 
            <CharacterOffsetBegin>0</CharacterOffsetBegin> 
            <CharacterOffsetEnd>3</CharacterOffsetEnd> 
            <POS>NNP</POS> 
            <NER>ORGANIZATION</NER> 
            <Speaker>PER0</Speaker> 
          </token> 
          <token id="2"> 
            <word>York</word> 
            <lemma>York</lemma> 
            <CharacterOffsetBegin>4</CharacterOffsetBegin> 
            <CharacterOffsetEnd>8</CharacterOffsetEnd> 
            <POS>NNP</POS> 
            <NER>ORGANIZATION</NER> 
            <Speaker>PER0</Speaker> 
          </token> 
          <token id="3"> 
            <word>Times</word> 
            <lemma>Times</lemma> 
            <CharacterOffsetBegin>9</CharacterOffsetBegin> 
            <CharacterOffsetEnd>14</CharacterOffsetEnd> 
            <POS>NNP</POS> 
            <NER>ORGANIZATION</NER> 
            <Speaker>PER0</Speaker> 
          </token> 
        </tokens> 
    
    0 讨论(0)
  • 2021-01-15 17:22

    From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
    props.put("regexner.mapping", "jg-regexner.txt");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    
    String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation"; 
    Annotation annotation = new Annotation(inputText);
    
    pipeline.annotate(annotation); 
    
    List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
    for (CoreMap multiWord : multiWordsExp) {
          String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
          System.out.println(multiWord +" : " +custNERClass);
    }
    
    0 讨论(0)
提交回复
热议问题