NLTK Named Entity recognition to a Python list

后端 未结 7 908
再見小時候
再見小時候 2020-11-28 08:14

I used NLTK\'s ne_chunk to extract named entities from a text:

my_sent = \"WASHINGTON -- In the wake of a string of abuses by New York police of         


        
相关标签:
7条回答
  • 2020-11-28 08:40

    You can also extract the label of each Name Entity in the text using this code:

    import nltk
    for sent in nltk.sent_tokenize(sentence):
       for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
          if hasattr(chunk, 'label'):
             print(chunk.label(), ' '.join(c[0] for c in chunk))
    

    Output:

    GPE WASHINGTON
    GPE New York
    PERSON Loretta E. Lynch
    GPE Brooklyn
    

    You can see Washington, New York and Brooklyn are GPE means geo-political entities

    and Loretta E. Lynch is a PERSON

    0 讨论(0)
  • 2020-11-28 08:45

    You may also consider using Spacy:

    import spacy
    nlp = spacy.load('en')
    
    doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')
    
    print([ent for ent in doc.ents])
    
    >>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]
    
    0 讨论(0)
  • 2020-11-28 08:46

    As you get a tree as a return value, I guess you want to pick those subtrees that are labeled with NE

    Here is a simple example to gather all those in a list:

    import nltk
    
    my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
    
    parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True)  # POS tagging before chunking!
    
    named_entities = []
    
    for t in parse_tree.subtrees():
        if t.label() == 'NE':
            named_entities.append(t)
            # named_entities.append(list(t))  # if you want to save a list of tagged words instead of a tree
    
    print named_entities
    

    This gives:

    [Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]
    

    or as a list of lists:

    [[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]
    

    Also see: How to navigate a nltk.tree.Tree?

    0 讨论(0)
  • 2020-11-28 08:50

    use tree2conlltags from nltk.chunk. Also ne_chunk needs pos tagging which tags word tokens (thus needs word_tokenize).

    from nltk import word_tokenize, pos_tag, ne_chunk
    from nltk.chunk import tree2conlltags
    
    sentence = "Mark and John are working at Google."
    print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
    """[('Mark', 'NNP', 'B-PERSON'), 
        ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), 
        ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), 
        ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), 
        ('.', '.', 'O')] """
    

    This will give you a list of tuples: [(token, pos_tag, name_entity_tag)] If this list is not exactly what you want, it is certainly easier to parse the list you want out of this list then an nltk tree.

    Code and details from this link; check it out for more information

    You can also continue by only extracting the words, with the following function:

    def wordextractor(tuple1):
    
        #bring the tuple back to lists to work with it
        words, tags, pos = zip(*tuple1)
        words = list(words)
        pos = list(pos)
        c = list()
        i=0
        while i<= len(tuple1)-1:
            #get words with have pos B-PERSON or I-PERSON
            if pos[i] == 'B-PERSON':
                c = c+[words[i]]
            elif pos[i] == 'I-PERSON':
                c = c+[words[i]]
            i=i+1
    
        return c
    
    print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))
    

    Edit Added output docstring **Edit* Added Output only for B-Person

    0 讨论(0)
  • 2020-11-28 09:00

    nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs. You can use list comprehension to do the same.

    import nltk   
    my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
    
    word = nltk.word_tokenize(my_sent)   
    pos_tag = nltk.pos_tag(word)   
    chunk = nltk.ne_chunk(pos_tag)   
    NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]   
    print (NE)
    
    0 讨论(0)
  • 2020-11-28 09:04

    nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs.

    Take a look at Named Entity Recognition with Regular Expression: NLTK

    >>> from nltk import ne_chunk, pos_tag, word_tokenize
    >>> from nltk.tree import Tree
    >>> 
    >>> def get_continuous_chunks(text):
    ...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
    ...     continuous_chunk = []
    ...     current_chunk = []
    ...     for i in chunked:
    ...             if type(i) == Tree:
    ...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
    ...             if current_chunk:
    ...                     named_entity = " ".join(current_chunk)
    ...                     if named_entity not in continuous_chunk:
    ...                             continuous_chunk.append(named_entity)
    ...                             current_chunk = []
    ...             else:
    ...                     continue
    ...     return continuous_chunk
    ... 
    >>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
    >>> get_continuous_chunks(my_sent)
    ['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
    
    
    >>> my_sent = "How's the weather in New York and Brooklyn"
    >>> get_continuous_chunks(my_sent)
    ['New York', 'Brooklyn']
    
    0 讨论(0)
提交回复
热议问题