I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is \"NE
If you just want the complete strings of each named entity found by Stanford NER, try this:
String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
System.out.println(text.substring(entity.second, entity.third), entity.second));
In case you're wondering, the entity class is indicated by entity.first
.
Alternatively, you can use ner.classifyWithInlineXML(text)
to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .
No, CoreNLP 3.5.0 has no utility to merge the NER labels. The next release (coming sometime next week) has a new MentionsAnnotator
which handles this merging for you. For now, you can (a) use the MentionsAnnotator
, available on the CoreNLP master branch, or (b) merge manually.
Use the -outputFormat xml
option to have CoreNLP output XML. (Is this what you want?)
You can set any property in the properties file, include the "outputFormat" property. Stanford CoreNLP supports several different formats such as json, xml, and text. However, the xml option is not an inlineXML format. The xml format gives per token annotations for NER.
<tokens>
<token id="1">
<word>New</word>
<lemma>New</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>3</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>York</word>
<lemma>York</lemma>
<CharacterOffsetBegin>4</CharacterOffsetBegin>
<CharacterOffsetEnd>8</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="3">
<word>Times</word>
<lemma>Times</lemma>
<CharacterOffsetBegin>9</CharacterOffsetBegin>
<CharacterOffsetEnd>14</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
</tokens>
From Stanford CoreNLP 3.6 and onwards, You can use entitymentions in Pipeline and get list of all Entities. I have shown an example here. It works.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation";
Annotation annotation = new Annotation(inputText);
pipeline.annotate(annotation);
List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
System.out.println(multiWord +" : " +custNERClass);
}