Improving the extraction of human names with nltk

匿名 (未验证) 提交于 2019-12-03 02:44:02

问题:

I am trying to extract human names from text.

Does anyone have a method that they would recommend?

This is what I tried (code is below): I am using nltk to find everything marked as a person and then generating a list of all the NNP parts of that person. I am skipping persons where there is only one NNP which avoids grabbing a lone surname.

I am getting decent results but was wondering if there are better ways to go about solving this problem.

Code:

import nltk from nameparser.parser import HumanName  def get_human_names(text):     tokens = nltk.tokenize.word_tokenize(text)     pos = nltk.pos_tag(tokens)     sentt = nltk.ne_chunk(pos, binary = False)     person_list = []     person = []     name = ""     for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):         for leaf in subtree.leaves():             person.append(leaf[0])         if len(person) > 1: #avoid grabbing lone surnames             for part in person:                 name += part + ' '             if name[:-1] not in person_list:                 person_list.append(name[:-1])             name = ''         person = []      return (person_list)  text = """ Some economists have responded positively to Bitcoin, including  Francois R. Velde, senior economist of the Federal Reserve in Chicago  who described it as "an elegant solution to the problem of creating a  digital currency." In November 2013 Richard Branson announced that  Virgin Galactic would accept Bitcoin as payment, saying that he had invested  in Bitcoin and found it "fascinating how a whole new global currency  has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical.  Economist Paul Krugman has suggested that the structure of the currency  incentivizes hoarding and that its value derives from the expectation that  others will accept it as payment. Economist Larry Summers has expressed  a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market  strategist for ConvergEx Group, has remarked on the effect of increasing  use of Bitcoin and its restricted supply, noting, "When incremental  adoption meets relatively fixed supply, it should be no surprise that  prices go up. And that’s exactly what is happening to BTC prices." """  names = get_human_names(text) print "LAST, FIRST" for name in names:      last_first = HumanName(name).last + ', ' + HumanName(name).first         print last_first 

Output:

LAST, FIRST Velde, Francois Branson, Richard Galactic, Virgin Krugman, Paul Summers, Larry Colas, Nick 

Apart from Virgin Galactic, this is all valid output. Of course, knowing that Virgin Galactic isn't a human name in the context of this article is the hard (maybe impossible) part.

回答1:

Must agree with suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included into NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk from nltk.tag.stanford import NERTagger st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') text = """YOUR TEXT GOES HERE"""  for sent in nltk.sent_tokenize(text):     tokens = nltk.tokenize.word_tokenize(sent)     tags = st.tag(tokens)     for tag in tags:         if tag[1]=='PERSON': print tag 

and got not so bad output:

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

Hope this is helpful.



回答2:

You can try to do resolution of the found names, and check if you can find them in a database such as freebase.com. Get the data locally and query it (it's in RDF), or use google's api: https://developers.google.com/freebase/v1/getting-started. Most big companies, geographical locations, etc. (that would be caught by your snippet) could be then discarded based on the freebase data.



回答3:

For anyone else looking, I found this article to be useful: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk >>> def extract_entities(text): ...     for sent in nltk.sent_tokenize(text): ...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))): ...             if hasattr(chunk, 'node'): ...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves()) ... 


回答4:

Spacy can be good alternative for retrieving names form a text.

https://spacy.io/usage/training#ner



回答5:

This worked pretty well for me. I just had to change one line in order for it to run.

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'): 

needs to be

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'): 

There were imperfections in the output (for example it identified "Money Laundering" as a person), but with my data a name database may not be dependable.



回答6:

The answer of @trojane didn't quite work for me, but helped a lot for this one.

Prerequesites

Create a folder stanford-ner and download the following two files to it:

Script

#!/usr/bin/env python # -*- coding: utf-8 -*-  import nltk from nltk.tag.stanford import StanfordNERTagger  text = u""" Some economists have responded positively to Bitcoin, including Francois R. Velde, senior economist of the Federal Reserve in Chicago who described it as "an elegant solution to the problem of creating a digital currency." In November 2013 Richard Branson announced that Virgin Galactic would accept Bitcoin as payment, saying that he had invested in Bitcoin and found it "fascinating how a whole new global currency has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical. Economist Paul Krugman has suggested that the structure of the currency incentivizes hoarding and that its value derives from the expectation that others will accept it as payment. Economist Larry Summers has expressed a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market strategist for ConvergEx Group, has remarked on the effect of increasing use of Bitcoin and its restricted supply, noting, "When incremental adoption meets relatively fixed supply, it should be no surprise that prices go up. And that’s exactly what is happening to BTC prices. """  st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',                        'stanford-ner/stanford-ner.jar')  for sent in nltk.sent_tokenize(text):     tokens = nltk.tokenize.word_tokenize(sent)     tags = st.tag(tokens)     for tag in tags:         if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:             print(tag) 

Results

(u'Bitcoin', u'LOCATION')       # wrong (u'Francois', u'PERSON') (u'R.', u'PERSON') (u'Velde', u'PERSON') (u'Federal', u'ORGANIZATION') (u'Reserve', u'ORGANIZATION') (u'Chicago', u'LOCATION') (u'Richard', u'PERSON') (u'Branson', u'PERSON') (u'Virgin', u'PERSON')         # Wrong (u'Galactic', u'PERSON')       # Wrong (u'Bitcoin', u'PERSON')        # Wrong (u'Bitcoin', u'LOCATION')      # Wrong (u'Bitcoin', u'LOCATION')      # Wrong (u'Paul', u'PERSON') (u'Krugman', u'PERSON') (u'Larry', u'PERSON') (u'Summers', u'PERSON') (u'Bitcoin', u'PERSON')        # Wrong (u'Nick', u'PERSON') (u'Colas', u'PERSON') (u'ConvergEx', u'ORGANIZATION') (u'Group', u'ORGANIZATION')      (u'Bitcoin', u'LOCATION')       # Wrong (u'BTC', u'ORGANIZATION')       # Wrong 


回答7:

I actually wanted to extract only the person name, so, thought to check all the names that come as an output against wordnet( A large lexical database of English). More Information on Wordnet can be found here: http://www.nltk.org/howto/wordnet.html

import nltk from nameparser.parser import HumanName from nltk.corpus import wordnet  person_names=person_list person_list = [] def get_human_names(text):     tokens = nltk.tokenize.word_tokenize(text)     pos = nltk.pos_tag(tokens)     sentt = nltk.ne_chunk(pos, binary = False)      person = []     name = ""     for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):         for leaf in subtree.leaves():             person.append(leaf[0])         if len(person) > 1: #avoid grabbing lone surnames             for part in person:                 name += part + ' '             if name[:-1] not in person_list:                 person_list.append(name[:-1])             name = ''         person = [] #     print (person_list)  text = """  Some economists have responded positively to Bitcoin, including  Francois R. Velde, senior economist of the Federal Reserve in Chicago  who described it as "an elegant solution to the problem of creating a  digital currency." In November 2013 Richard Branson announced that  Virgin Galactic would accept Bitcoin as payment, saying that he had invested  in Bitcoin and found it "fascinating how a whole new global currency  has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical.  Economist Paul Krugman has suggested that the structure of the currency  incentivizes hoarding and that its value derives from the expectation that  others will accept it as payment. Economist Larry Summers has expressed  a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market  strategist for ConvergEx Group, has remarked on the effect of increasing  use of Bitcoin and its restricted supply, noting, "When incremental  adoption meets relatively fixed supply, it should be no surprise that  prices go up. And that’s exactly what is happening to BTC prices." """  names = get_human_names(text) for person in person_list:     person_split = person.split(" ")     for name in person_split:         if wordnet.synsets(name):             if(name in person):                 person_names.remove(person)                 break  print(person_names) 

OUTPUT

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas'] 

Apart from Larry Summers all the names are correct and that is because of the last name "Summers".



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!