Differentiate between countries and cities in spacy NER

被刻印的时光 ゝ 提交于 2021-02-04 16:22:37

问题


I'm trying to extract countries from organisation addresses using spacy NER, however, it labels countries and cities with the same tag GPE. Is there any way I can differentiate them?

for instance:

nlp = en_core_web_sm.load()

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        print(ent.text)

gives back

Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States

回答1:


As other answers have mentioned, GPE for the pre-trained Spacy model is for countries, cities and states. However, there is a workaround and I'm sure several approaches can be used.

One approach: You could add a custom tag to the model. There is a good article on Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from Stack Overflow:

Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.

When you attempt to train on single words, it is unable to get generalized enough features to detect those entities.

An easier workaround to this could be the following:

Install geonamescache

pip install geonamescache

Then use the following code to get the list of countries and cities

import geonamescache

gc = geonamescache.GeonamesCache()

# gets nested dictionary for countries
countries = gc.get_countries()

# gets nested dictionary for cities
cities = gc.get_cities()

The documentation states that you can get a host of other location options as well.

Use the following function to get all the values of a key with a certain name from a nested dictionary (obtained from this answer)

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)

Load up two lists of cities and countries respectively.

cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]

Then use the following code to differentiate:

nlp = spacy.load("en_core_web_sm")

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        if ent.text in countries:
            print(f"Country : {ent.text}")
        elif ent.text in cities:
            print(f"City : {ent.text}")
        else:
            print(f"Other GPE : {ent.text}")

Output:

City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States



回答2:


As stated before, the GPE entity predicts Countries, cities and states, therefore you won't be able to detect only countries entities with the given model.

I would suggest to simply create a list of countries and then check whether the GPE entity is in this list or not.

nlp = en_core_web_sm.load()

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']

for ent in doc.ents:
    if ent.label_ == 'GPE':
        # check if the value is in the list of countries
        if ent.text in countries:
            print(ent.text, '-- Country')
        else:
            print(ent.text, '-- City or State')

This will output the following:

Tempe -- City or State

United States -- Country

Monterey -- City or State

United States -- Country

Tempe -- City or State

United States -- Country

United States -- Country



来源:https://stackoverflow.com/questions/59444065/differentiate-between-countries-and-cities-in-spacy-ner

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!