Stanford Core NLP - understanding coreference resolution

后端 未结 3 1064
终归单人心
终归单人心 2020-12-07 23:17

I\'m having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corre

相关标签:
3条回答
  • 2020-12-08 00:05

    The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document). The pair numbers are outout of CorefChain#toString():

    public String toString(){
        return position.toString();
    }
    

    where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:

    class Example {
        public static void main(String[] args) {
            Properties props = new Properties();
            props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            props.put("dcoref.score", true);
            pipeline = new StanfordCoreNLP(props);
            Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
    
            pipeline.annotate(document);
            Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
    
            println aText
    
            for(Map.Entry<Integer, CorefChain> entry : graph) {
              CorefChain c =   entry.getValue();                
              println "ClusterId: " + entry.getKey();
              CorefMention cm = c.getRepresentativeMention();
              println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
    
              List<CorefMention> cms = c.getCorefMentions();
              println  "Mentions:  ";
              cms.each { it -> 
                  print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
              }         
            }
        }
    }
    

    Output (I do not understand where 's' comes from):

    The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
    ClusterId: 1
    Representative Mention: he
    Mentions: he|atom |s|
    ClusterId: 6
    Representative Mention:  basic unit 
    Mentions:  basic unit |
    ClusterId: 8
    Representative Mention:  unit 
    Mentions:  unit |
    ClusterId: 10
    Representative Mention: it 
    Mentions: it |
    
    0 讨论(0)
  • 2020-12-08 00:13

    These are the recent results from the annotator.

    1. [1, 1] 1 The atom
    2. [1, 2] 1 a basic unit of matter
    3. [1, 3] 1 it
    4. [1, 6] 6 negatively charged electrons
    5. [1, 5] 5 a cloud of negatively charged electrons

    The markings are as follows :

    [Sentence number,'id']  Cluster_no  Text_Associated
    

    The text belonging to the same cluster refers to the same context.

    0 讨论(0)
  • 2020-12-08 00:20

    I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.

    For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.

    Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);
    
    for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
        CorefChain c = entry.getValue();
    
        //this is because it prints out a lot of self references which aren't that useful
        if(c.getCorefMentions().size() <= 1)
            continue;
    
        CorefMention cm = c.getRepresentativeMention();
        String clust = "";
        List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
        for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
            clust += tks.get(i).get(TextAnnotation.class) + " ";
        clust = clust.trim();
        System.out.println("representative mention: \"" + clust + "\" is mentioned by:");
    
        for(CorefMention m : c.getCorefMentions()){
            String clust2 = "";
            tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
            for(int i = m.startIndex-1; i < m.endIndex-1; i++)
                clust2 += tks.get(i).get(TextAnnotation.class) + " ";
            clust2 = clust2.trim();
            //don't need the self mention
            if(clust.equals(clust2))
                continue;
    
            System.out.println("\t" + clust2);
        }
    }
    

    And the final output for your example sentence is the following:

    representative mention: "a basic unit of matter" is mentioned by:
    The atom
    it
    

    Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:

    The Revolutionary War occurred during the 1700s and it was the first war in the United States.

    produces the following output:

    representative mention: "The Revolutionary War" is mentioned by:
    it
    the first war in the United States
    
    0 讨论(0)
提交回复
热议问题