DBpedia SPARQL query returns multiple and duplicate records

孤者浪人 提交于 2019-12-19 11:22:18

问题


I am quite new to SPARQL and also becoming confused by the manifold syntax standards existing for it. I am struggling to fetch unique data from DBpedia using the following query:

SELECT DISTINCT ?Museum, ?name, ?abstract, ?thumbnail, ?latitude,
   ?longitude, ?photoCollection, ?website, ?homepage, ?wikilink
WHERE { 
  ?Museum a dbpedia-owl:Museum ; 
          dbpprop:name ?name ; 
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          geo:lat ?latitude ;  
          geo:long ?longitude ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
LIMIT 20

SPARQL results

As anyone can see, the entries for Geffrye_Museum and Institute_for_Museum_Research are repeated in results because Institute_for_Museum_Research has two different values for its name and Geffrye_Museum has two longitude values. In both these duplicate cases, I want that the second values be discarded; i.e., for Geffrye_Museum the longitude value -0.0762194 must be ignored, and for Institute_for_Museum_Research the name value "Institut für Museumsforschung"@en must be ignored.

Note that I am already applying filtering for the fields I want and this is simply abundance of data in DBpedia that I want to tackle at query level itself. So how can I make DBpedia return only the first value when there are multiple values for same column?


回答1:


Let's look at one case first. In the case of the Geffrye the duplicate results occur because multiple longitudes are present in the data, as the following query demonstrates:

SELECT ?museum ?latitude ?longitude
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum ?latitude ?longitude

SPARQL results

which produces

museum                                     latitude longitude
http://dbpedia.org/resource/Geffrye_Museum 51.5317  -0.07663
http://dbpedia.org/resource/Geffrye_Museum 51.5317  -0.0762194

Fortunately, this is easy enough to remedy. As discussed in this question you can group the results by their characteristic values, and then sample, minimize, maximize, etc., over the values to get precisely what you want. For instance, if you want the greatest valued longitude, you can use MAX(?longtude) as ?longitude in your SELECT, as in the following query, which produces a single value.

SELECT ?museum ?latitude (MAX(?longitude) as ?longitude)
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum ?latitude

SPARQL results

Of course, it presumes a bit of knowledge to group by ?latitude and to maximize over ?longitude. It's probably a better idea to just group by ?museum and use aggregate projection to pull out the other values, as in:

SELECT ?museum (MAX(?latitude) as ?latitude) (MAX(?longitude) as ?longitude)
WHERE { 
  VALUES ?museum { dbpedia:Geffrye_Museum }
  ?museum a dbpedia-owl:Museum ; 
          geo:lat ?latitude ;  
          geo:long ?longitude .
}
GROUP BY ?museum

SPARQL results

Taking this approach to all the variables produces something like this:

SELECT DISTINCT ?Museum
  (SAMPLE(?name) as ?name)
  (SAMPLE(?abstract) as ?abstract)
  (SAMPLE(?thumbnail) as ?thumbnail)
  (MAX(?latitude) as ?latitude)
  (MAX(?longitude) as ?longitude)
  (SAMPLE(?photoCollection) as ?photoCollection)
  (SAMPLE(?website) as ?website)
  (SAMPLE(?homepage) as ?homepage)
  (SAMPLE(?wikilink) as ?wikilink)
WHERE { 
  ?Museum a dbpedia-owl:Museum ; 
          dbpprop:name ?name ; 
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          geo:lat ?latitude ;  
          geo:long ?longitude ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20

SPARQL results

It might seem a bit awkward to have to use the aggregate projection on all your variables, but it will work. However, you can also do the aggregation in a subquery first, and that will clean the variable projections up, at the cost of a subquery. (The subquery doesn't necessarily have a negative impact on the query; in fact it could be the opposite. The query itself is a bit harder to read, though.)

SELECT * WHERE { 
  # Select museums and a single latitude and longitude for them.
  {
    SELECT ?Museum (MAX(?longitude) as ?longitude) (MAX(?latitude) as ?latitude) WHERE {
      ?Museum a dbpedia-owl:Museum ;
              geo:lat ?latitude ;
              geo:long ?longitude .
    }
    GROUP BY ?Museum
  }
  # Get the rest of the properties of the museum.
  ?Museum dbpprop:name ?name ;
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
  FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20

SPARQL results

Finally, since you need to normalize over names as well as geographic coordinates, your final query would be something like the following. In your question, you only said that you wanted to keep the “first result,” but there's no particular order imposed on the results, so there is no unique “first result.” With the data at hand, you can use (MIN(?name) as ?name) and you'll get the name you wanted for the Institute for Museum Research, but if you have a particular constraint in mind, you'll need to figure out how to make that more specific.

SELECT * WHERE { 
  # Select museums and a single latitude, longitude, and name for them.
  {
    SELECT ?Museum 
           (MIN(?name) as ?name)
           (MAX(?longitude) as ?longitude)
           (MAX(?latitude) as ?latitude)
    WHERE {
      ?Museum a dbpedia-owl:Museum ;
              dbpprop:name ?name ;
              geo:lat ?latitude ;
              geo:long ?longitude .
      FILTER (langMatches(lang(?name),"EN"))
    }
    GROUP BY ?Museum
  }
  # Get the rest of the properties of the museum.
  ?Museum dbpprop:name ?name ;
          dbpedia-owl:abstract ?abstract ; 
          dbpedia-owl:thumbnail ?thumbnail ; 
          dbpprop:hasPhotoCollection ?photoCollection ;
          dbpprop:website ?website ; 
          foaf:homepage ?homepage ; 
          foaf:isPrimaryTopicOf ?wikilink .
  FILTER(langMatches(lang(?abstract),"EN")) 
}
LIMIT 20

SPARQL results



来源:https://stackoverflow.com/questions/17174439/dbpedia-sparql-query-returns-multiple-and-duplicate-records

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!