Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

前端 未结 6 1660
孤独总比滥情好
孤独总比滥情好 2020-11-29 01:36

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

from nltk.tag.stanford impo         


        
相关标签:
6条回答
  • 2020-11-29 02:06

    IOB/BIO means Inside, Outside, Beginning (IOB), or sometimes aka Beginning, Inside, Outside (BIO)

    The Stanford NE tagger returns IOB/BIO style tags, e.g.

    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
    ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
    ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
    

    The ('Rami', 'PERSON'), ('Eid', 'PERSON') are tagged as PERSON and "Rami" is the Beginning or a NE chunk and "Eid" is the inside. And then you see that any non-NE will be tagged with "O".

    The idea to extract continuous NE chunk is very similar to Named Entity Recognition with Regular Expression: NLTK but because the Stanford NE chunker API doesn't return a nice tree to parse, you have to do this:

    def get_continuous_chunks(tagged_sent):
        continuous_chunk = []
        current_chunk = []
    
        for token, tag in tagged_sent:
            if tag != "O":
                current_chunk.append((token, tag))
            else:
                if current_chunk: # if the current chunk is not empty
                    continuous_chunk.append(current_chunk)
                    current_chunk = []
        # Flush the final current_chunk into the continuous_chunk, if any.
        if current_chunk:
            continuous_chunk.append(current_chunk)
        return continuous_chunk
    
    ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
    
    named_entities = get_continuous_chunks(ne_tagged_sent)
    named_entities = get_continuous_chunks(ne_tagged_sent)
    named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
    named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
    
    print named_entities
    print
    print named_entities_str
    print
    print named_entities_str_tag
    print
    

    [out]:

    [[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
    
    ['Rami Eid', 'Stony Brook University', 'NY']
    
    [('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
    

    But please note the limitation that if two NEs are continuous, then it might be wrong, nevertheless i still can't think of any example where two NEs are continuous without any "O" between them.


    As @alexis suggested, it's better to convert the stanford NE output into NLTK trees:

    from nltk import pos_tag
    from nltk.chunk import conlltags2tree
    from nltk.tree import Tree
    
    def stanfordNE2BIO(tagged_sent):
        bio_tagged_sent = []
        prev_tag = "O"
        for token, tag in tagged_sent:
            if tag == "O": #O
                bio_tagged_sent.append((token, tag))
                prev_tag = tag
                continue
            if tag != "O" and prev_tag == "O": # Begin NE
                bio_tagged_sent.append((token, "B-"+tag))
                prev_tag = tag
            elif prev_tag != "O" and prev_tag == tag: # Inside NE
                bio_tagged_sent.append((token, "I-"+tag))
                prev_tag = tag
            elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
                bio_tagged_sent.append((token, "B-"+tag))
                prev_tag = tag
    
        return bio_tagged_sent
    
    
    def stanfordNE2tree(ne_tagged_sent):
        bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
        sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
        sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
    
        sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
        ne_tree = conlltags2tree(sent_conlltags)
        return ne_tree
    
    ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
    ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
    ('in', 'O'), ('NY', 'LOCATION')]
    
    ne_tree = stanfordNE2tree(ne_tagged_sent)
    
    print ne_tree
    

    [out]:

      (S
      (PERSON Rami/NNP Eid/NNP)
      is/VBZ
      studying/VBG
      at/IN
      (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
      in/IN
      (LOCATION NY/NNP))
    

    Then:

    ne_in_sent = []
    for subtree in ne_tree:
        if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
            ne_label = subtree.label()
            ne_string = " ".join([token for token, pos in subtree.leaves()])
            ne_in_sent.append((ne_string, ne_label))
    print ne_in_sent
    

    [out]:

    [('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
    
    0 讨论(0)
  • 2020-11-29 02:07

    WARNING: Even if u get this model "all.3class.distsim.crf.ser.gz" please dont use it because

      1st reason :

    For this model stanford nlp people have openly appologized for bad accuracy

      2nd reason :

    It has bad accuracy becase it is case sensitive .

      SOLUTION

    use the model called "english.all.3class.caseless.distsim.crf.ser.gz"

    0 讨论(0)
  • 2020-11-29 02:12

    Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

    Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

    You have the following options:

    1. Collect runs of identically tagged words; e.g., all adjacent words tagged PERSON should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimore is about three cities, not one.) Edit: This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.

    2. Use nltk.ne_recognize(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).

    3. Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.

    4. Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

    Edit: If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

    from itertools import groupby
    for tag, chunk in groupby(netagged_words, lambda x:x[1]):
        if tag != "O":
            print("%-12s"%tag, " ".join(w for w, t in chunk))
    

    If netagged_words is the list of (word, type) tuples in your question, this produces:

    PERSON       Rami Eid
    ORGANIZATION Stony Brook University
    LOCATION     NY
    

    Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore is about three cities, not one.

    0 讨论(0)
  • 2020-11-29 02:12

    Not exactly as per the topic author requirement to print what he wants, maybe this can be of any help,

    listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
    ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
    ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
    
    
    def parser(n, string):
        for i in listx[n]:
            if i == string:
                pass
            else:
                return i
    
    name = parser(0,'PERSON')
    lname = parser(1,'PERSON')
    org1 = parser(5,'ORGANIZATION')
    org2 = parser(6,'ORGANIZATION')
    org3 = parser(7,'ORGANIZATION')
    
    
    print name, lname
    print org1, org2, org3
    

    Output would be something like this

    Rami Eid
    Stony Brook University
    
    0 讨论(0)
  • 2020-11-29 02:16

    Try using the "enumerate" method.

    When you apply NER to the list of words, once tuples are created of (word,type), enumerate this list using the enumerate(list). This would assign an index to every tuple in the list.

    So later, when you extract PERSON/ORGANISATION/LOCATION from the list they would have an index attached to it.

    1   Hussein
    2   Obama
    3   II
    6   James
    7   Naismith
    21   Naismith
    19   Tony
    20   Hinkle
    0   Frank
    1   Mahan
    14   Naismith
    0   Naismith
    0   Mahan
    0   Mahan
    0   Naismith
    

    Now on the basis of the consecutive index a single name can be filtered out.

    Hussein Obama II, James Naismith, Tony Hank, Frank Mahan

    0 讨论(0)
  • 2020-11-29 02:19

    Use pycorenlp wrapper from python and then use 'entitymentions' as a key to get the continuous chunk of person or organization in a single string.

    0 讨论(0)
提交回复
热议问题