问题
I'm trying to extract countries from organisation addresses using spacy NER, however, it labels countries and cities with the same tag GPE
. Is there any way I can differentiate them?
for instance:
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
print(ent.text)
gives back
Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States
回答1:
As other answers have mentioned, GPE for the pre-trained Spacy model is for countries, cities and states. However, there is a workaround and I'm sure several approaches can be used.
One approach: You could add a custom tag to the model. There is a good article on Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from Stack Overflow:
Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.
When you attempt to train on single words, it is unable to get generalized enough features to detect those entities.
An easier workaround to this could be the following:
Install geonamescache
pip install geonamescache
Then use the following code to get the list of countries and cities
import geonamescache
gc = geonamescache.GeonamesCache()
# gets nested dictionary for countries
countries = gc.get_countries()
# gets nested dictionary for cities
cities = gc.get_cities()
The documentation states that you can get a host of other location options as well.
Use the following function to get all the values of a key with a certain name from a nested dictionary (obtained from this answer)
def gen_dict_extract(var, key):
if isinstance(var, dict):
for k, v in var.items():
if k == key:
yield v
if isinstance(v, (dict, list)):
yield from gen_dict_extract(v, key)
elif isinstance(var, list):
for d in var:
yield from gen_dict_extract(d, key)
Load up two lists of cities
and countries
respectively.
cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]
Then use the following code to differentiate:
nlp = spacy.load("en_core_web_sm")
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
if ent.text in countries:
print(f"Country : {ent.text}")
elif ent.text in cities:
print(f"City : {ent.text}")
else:
print(f"Other GPE : {ent.text}")
Output:
City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
回答2:
As stated before, the GPE
entity predicts Countries, cities and states
, therefore you won't be able to detect only countries entities with the given model.
I would suggest to simply create a list of countries and then check whether the GPE
entity is in this list or not.
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']
for ent in doc.ents:
if ent.label_ == 'GPE':
# check if the value is in the list of countries
if ent.text in countries:
print(ent.text, '-- Country')
else:
print(ent.text, '-- City or State')
This will output the following:
Tempe -- City or State
United States -- Country
Monterey -- City or State
United States -- Country
Tempe -- City or State
United States -- Country
United States -- Country
来源:https://stackoverflow.com/questions/59444065/differentiate-between-countries-and-cities-in-spacy-ner