I would like to group all named entities in a given document. For Example,
**Barack Hussein Obama** II is the 44th and current President of the United States, a
If you want to avoid using NER, you could use a sentence chunker or parser. This will extract noun phrases generically. OpenNLP has a sentence chunker and parser, but if you are for some reason adverse to using OpenNLP, you can try others. If you are interested in using the OpenNLP chunker i will post some code that extracts noun phrases using OpenNLP.
Here is the code. You will need to download the models from sourceforge here
http://opennlp.sourceforge.net/models-1.5/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
/**
*
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
*/
public class OpenNLPNounPhraseExtractor {
static final int N = 2;
public static void main(String[] args) {
try {
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
if (chunks[i].getType().equals("NP")) {
System.out.println("NP: \n\t" + chunkStrings[i]);
String[] split = chunkStrings[i].split(" ");
List<String> ngrams = ngram(Arrays.asList(split), N, " ");
System.out.println("ngrams:");
for (String gram : ngrams) {
System.out.println("\t" + gram);
}
}
}
} catch (IOException e) {
}
}
public static List<String> ngram(List<String> input, int n, String separator) {
if (input.size() <= n) {
return input;
}
List<String> outGrams = new ArrayList<String>();
for (int i = 0; i < input.size() - (n - 2); i++) {
String gram = "";
if ((i + n) <= input.size()) {
for (int x = i; x < (n + i); x++) {
gram += input.get(x) + separator;
}
gram = gram.substring(0, gram.lastIndexOf(separator));
outGrams.add(gram);
}
}
return outGrams;
}
}
the output I get with your sentence is this (with N set to 2 (bigram)
NP:
Barack Hussein Obama II
ngrams:
Barack Hussein
Hussein Obama
Obama II
NP:
the 44th and current President
ngrams:
the 44th
44th and
and current
current President
NP:
the United States
ngrams:
the United
United States
NP:
the first African American
ngrams:
the first
first African
African American
NP:
the office
ngrams:
the
office
this does not explicitly handle the case of when an adjective falls outside of the NP... if so you can get this info from the POS tags and integrate it. What I gave you should send you in the right direction.