grouping all Named entities in a Document

后端 未结 1 1133
执念已碎
执念已碎 2021-01-29 03:05

I would like to group all named entities in a given document. For Example,

**Barack Hussein Obama** II  is the 44th and current President of the United States, a         


        
相关标签:
1条回答
  • 2021-01-29 03:33

    If you want to avoid using NER, you could use a sentence chunker or parser. This will extract noun phrases generically. OpenNLP has a sentence chunker and parser, but if you are for some reason adverse to using OpenNLP, you can try others. If you are interested in using the OpenNLP chunker i will post some code that extracts noun phrases using OpenNLP.

    Here is the code. You will need to download the models from sourceforge here

    http://opennlp.sourceforge.net/models-1.5/

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.List;
    import opennlp.tools.chunker.ChunkerME;
    import opennlp.tools.chunker.ChunkerModel;
    import opennlp.tools.postag.POSModel;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.tokenize.TokenizerME;
    import opennlp.tools.tokenize.TokenizerModel;
    import opennlp.tools.util.Span;
    
    /**
     *
     * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
     * the SentenceDetector classes.
     */
    public class OpenNLPNounPhraseExtractor {
    
      static final int N = 2;
    
      public static void main(String[] args) {
    
        try {
          String modelPath = "c:\\temp\\opennlpmodels\\";
          TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
          TokenizerME wordBreaker = new TokenizerME(tm);
          POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
          POSTaggerME posme = new POSTaggerME(pm);
          InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
          ChunkerModel chunkerModel = new ChunkerModel(modelIn);
          ChunkerME chunkerME = new ChunkerME(chunkerModel);
          //this is your sentence
          String sentence = "Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
          //words is the tokenized sentence
          String[] words = wordBreaker.tokenize(sentence);
          //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
          String[] posTags = posme.tag(words);
          //chunks are the start end "spans" indices to the chunks in the words array
          Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
          //chunkStrings are the actual chunks
          String[] chunkStrings = Span.spansToStrings(chunks, words);
          for (int i = 0; i < chunks.length; i++) {
            if (chunks[i].getType().equals("NP")) {
              System.out.println("NP: \n\t" + chunkStrings[i]);
              String[] split = chunkStrings[i].split(" ");
    
              List<String> ngrams = ngram(Arrays.asList(split), N, " ");
              System.out.println("ngrams:");
              for (String gram : ngrams) {
                System.out.println("\t" + gram);
              }
    
            }
          }
    
    
        } catch (IOException e) {
        }
      }
    
      public static List<String> ngram(List<String> input, int n, String separator) {
        if (input.size() <= n) {
          return input;
        }
        List<String> outGrams = new ArrayList<String>();
        for (int i = 0; i < input.size() - (n - 2); i++) {
          String gram = "";
          if ((i + n) <= input.size()) {
            for (int x = i; x < (n + i); x++) {
              gram += input.get(x) + separator;
            }
            gram = gram.substring(0, gram.lastIndexOf(separator));
            outGrams.add(gram);
          }
        }
        return outGrams;
      }
    }
    

    the output I get with your sentence is this (with N set to 2 (bigram)

    NP: 
        Barack Hussein Obama II
    ngrams:
        Barack Hussein
        Hussein Obama
        Obama II
    NP: 
        the 44th and current President
    ngrams:
        the 44th
        44th and
        and current
        current President
    NP: 
        the United States
    ngrams:
        the United
        United States
    NP: 
        the first African American
    ngrams:
        the first
        first African
        African American
    NP: 
        the office
    ngrams:
        the
        office
    

    this does not explicitly handle the case of when an adjective falls outside of the NP... if so you can get this info from the POS tags and integrate it. What I gave you should send you in the right direction.

    0 讨论(0)
提交回复
热议问题