SPARQL query to retrieve countries population from DBpedia

早过忘川 提交于 2020-01-04 09:17:05

问题


I have developed the following SPARQL query to get a list of countries with its population from DBpedia. I use the union clauses to identify which resources are current countries because the information is inconsistent between the different countries, for example there are different standards for country codes and some of them don't even have one.

Now the problem that I have is that some of the countries have a dbpprop:populationEstimate property but others have dbpprop:populationCensus and I don't know how to get both of them to bind ?population. As it is now I only get the estimate population, I guess it is because having two OPTIONAL clauses to match ?population doesn't make sense, but I can't get any closer to the solution.

For example India have dbpprop:populationCensus, but it doesn't appear in the result.

PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX yago:<http://dbpedia.org/class/yago/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX category: <http://dbpedia.org/resource/Category:>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?name ?population
WHERE {
    ?country a dbo:Country .
    ?country rdfs:label ?enName .   

    OPTIONAL {?country dbpprop:populationEstimate ?population}
    OPTIONAL {?country dbpprop:populationCensus ?population}
    OPTIONAL {?country dbpprop:yearEnd ?yearEnd}

    { ?country dbpprop:iso3166code ?code . }
    UNION
    { ?country dbpprop:iso31661Alpha ?code . }
    UNION
    { ?country dbpprop:countryCode ?code . }
    UNION
    { ?country a yago:MemberStatesOfTheUnitedNations . }

    FILTER (langMatches(lang(?enName), "en")) 
    FILTER (!bound(?yearEnd))
    FILTER (xsd:integer(?population))
    BIND (str(?enName) AS ?name)
}

Thanks everyone for your help :)


回答1:


First, I'm going to use the prefixes defined in the DBpedia SPARQL endpoint so that we can copy and paste queries. I think the only difference is that dbo will now be dbpedia-owl. Second, you're using a number of raw data properties, but if you can, you ought to try to use properties from the ontology, as explained in this answer. That doesn't necessarily affect the results you're getting here, but you'll generally get cleaner data if you use the ontology properties.

Modifying your query

FILTER NOT EXISTS for removing countries that have ended

Let's clean up the query a little bit first, and then tend to the question of the getting the various population properties. Removing countries that have an end date can be done a bit more simply. Instead of

OPTIONAL {?country dbpprop:yearEnd ?yearEnd}
FILTER (!bound(?yearEnd))

you can use FILTER NOT EXISTS to make this a bit more direct:

FILTER NOT EXISTS { ?country dbpprop:yearEnd ?yearEnd }

In an attempt to use properties from the DBpedia ontology in preference to Raw Infobox data properties, you might want to consider using dbpedia-owl:dissolutionYear rather than dbpprop:yearEnd, giving:

FILTER NOT EXISTS { ?country dbpedia-owl:dissoluationYear ?yearEnd }

Simplify filtering for languages

It's reasonable to expect rdfs:label values to be literals, and the lang function requires its argument to be a literal, so you don't really need to bind str(?enName) to ?name; it's sufficient just to bind ?name in the triple pattern, and then check its language (which you're doing correctly using langMatches). That is, instead of

?country rdfs:label ?enName .   
FILTER (langMatches(lang(?enName), "en")) 
BIND (str(?enName) AS ?name)

you can just use

?country rdfs:label ?name .   
FILTER (langMatches(lang(?name), "en"))

This does mean that the name you get back will have a language tag. If you really just want the plain string, you can either BIND as you did before, or make an as expression in the select, e.g.,

SELECT DISTINCT (str(?name) as ?noLangName) ?population

Checking that population is bound and is a number

I don't think filtering on xsd:integer(?population) will do much for you either. That notation isn't a type predicate, but a casting function, so ?population is being cast as an integer, and I think the filter will always let the value through, except in the case of 0, which would fail. You'd still want to know if a country has a population of 0 though, right? However, you do only want countries with populations, so you could filter with bound:

FILTER(bound(?population))

However, since the properties here are raw infobox properties, there is some noise in the data, so we end up with values like

"Denmark"@en "- Density 57,695"@en
"Denmark"@en "- Faroe Islands"@en

which aren't useful. A better filter would just check that the value is a number (which will implicitly require that it's bound), and there is a function isNumeric for just that purpose, so we use:

FILTER (isNumeric(?population))

Simplifying similar UNION patterns with VALUES

You can clean up the UNION pattern by using VALUES. Instead of UNIONing several almost identical patterns, you can define a variable ?hasCode that will only have the values dbpprop:iso3166code, etc. I.e., instead of:

{ ?country dbpprop:iso3166code ?code . }
UNION
{ ?country dbpprop:iso31661Alpha ?code . }
UNION
{ ?country dbpprop:countryCode ?code . }
UNION
{ ?country a yago:MemberStatesOfTheUnitedNations . }

you can use:

values ?hasCode { dbpprop:iso3166code dbpprop:iso31661Alpha dbpprop:countryCode }
{ ?country ?hasCode ?code . }
UNION
{ ?country a yago:MemberStatesOfTheUnitedNations . }

You can do a similar thing with the ?population retrieval:

OPTIONAL {?country dbpprop:populationEstimate ?population}
OPTIONAL {?country dbpprop:populationCensus ?population}

can become:

values ?hasPopulation { dbpprop:populationEstimate dbpprop:populationCensus }
OPTIONAL { ?country ?hasPopulation ?population }

The final result

The rewritten query is now:

SELECT DISTINCT ?name ?population
WHERE {
    ?country a dbpedia-owl:Country .
    ?country rdfs:label ?name .   
    FILTER (langMatches(lang(?name), "en")) 

    values ?hasPopulation { dbpprop:populationEstimate dbpprop:populationCensus }
    OPTIONAL { ?country ?hasPopulation ?population }
    FILTER (isNumeric(?population))

    FILTER NOT EXISTS { ?country dbpedia-owl:dissolutionYear ?yearEnd }

    values ?hasCode { dbpprop:iso3166code dbpprop:iso31661Alpha dbpprop:countryCode }
    { ?country ?hasCode ?code . }
    UNION
    { ?country a yago:MemberStatesOfTheUnitedNations . }
}

SPARQL results

India now appears in the results with a population:

"India"@en 1210193422



回答2:


How to work around the problem

I think I have an idea of how you could work around this problem.

For the optional clauses, use separate variables

OPTIONAL {?country dbpprop:populationEstimate ?populationEstimate}
OPTIONAL {?country dbpprop:populationCensus ?populationCensus}
OPTIONAL {?country dbpprop:yearEnd ?yearEnd}

Then, bind one of them to ?population

BIND(IF(bound(?populationEstimate), ?populationEstimate, ?populationCensus) as ?population)

Finally, check the bound variable in your filter expression

FILTER (xsd:integer(?population))

The rest of the query remains the same. I've tested this against the DBpedia SPARQL endpoint and at first glance, it seems to yield the right results.

Let me know if this is correct.

The full query

PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX yago:<http://dbpedia.org/class/yago/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX category: <http://dbpedia.org/resource/Category:>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?name ?population 
WHERE {
    ?country a dbo:Country .
    ?country rdfs:label ?enName .   

    OPTIONAL {?country dbpprop:populationEstimate ?populationEstimate}
    OPTIONAL {?country dbpprop:populationCensus ?populationCensus}
    OPTIONAL {?country dbpprop:yearEnd ?yearEnd}


    BIND(IF(bound(?populationEstimate), ?populationEstimate, ?populationCensus) as ?population)


    FILTER (langMatches(lang(?enName), "en")) 
    FILTER (!bound(?yearEnd))
    FILTER (xsd:integer(?population))

    { ?country dbpprop:iso3166code ?code . }
    UNION
    { ?country dbpprop:iso31661Alpha ?code . }
    UNION
    { ?country dbpprop:countryCode ?code . }
    UNION
    { ?country a yago:MemberStatesOfTheUnitedNations . }

    BIND (str(?enName) AS ?name)
}


来源:https://stackoverflow.com/questions/19145979/sparql-query-to-retrieve-countries-population-from-dbpedia

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!