问题
I am quite new to SPARQL and also becoming confused by the manifold syntax standards existing for it. I am struggling to fetch unique data from DBpedia using the following query:
SELECT DISTINCT ?Museum, ?name, ?abstract, ?thumbnail, ?latitude,
?longitude, ?photoCollection, ?website, ?homepage, ?wikilink
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
geo:lat ?latitude ;
geo:long ?longitude ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
LIMIT 20
SPARQL results
As anyone can see, the entries for Geffrye_Museum
and Institute_for_Museum_Research
are repeated in results because Institute_for_Museum_Research
has two different values for its name and Geffrye_Museum
has two longitude values. In both these duplicate cases, I want that the second values be discarded; i.e., for Geffrye_Museum
the longitude value -0.0762194
must be ignored, and for Institute_for_Museum_Research
the name value "Institut für Museumsforschung"@en
must be ignored.
Note that I am already applying filtering for the fields I want and this is simply abundance of data in DBpedia that I want to tackle at query level itself. So how can I make DBpedia return only the first value when there are multiple values for same column?
回答1:
Let's look at one case first. In the case of the Geffrye the duplicate results occur because multiple longitudes are present in the data, as the following query demonstrates:
SELECT ?museum ?latitude ?longitude
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum ?latitude ?longitude
SPARQL results
which produces
museum latitude longitude
http://dbpedia.org/resource/Geffrye_Museum 51.5317 -0.07663
http://dbpedia.org/resource/Geffrye_Museum 51.5317 -0.0762194
Fortunately, this is easy enough to remedy. As discussed in this question you can group the results by their characteristic values, and then sample, minimize, maximize, etc., over the values to get precisely what you want. For instance, if you want the greatest valued longitude, you can use MAX(?longtude) as ?longitude
in your SELECT, as in the following query, which produces a single value.
SELECT ?museum ?latitude (MAX(?longitude) as ?longitude)
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum ?latitude
SPARQL results
Of course, it presumes a bit of knowledge to group by ?latitude
and to maximize over ?longitude
. It's probably a better idea to just group by ?museum
and use aggregate projection to pull out the other values, as in:
SELECT ?museum (MAX(?latitude) as ?latitude) (MAX(?longitude) as ?longitude)
WHERE {
VALUES ?museum { dbpedia:Geffrye_Museum }
?museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?museum
SPARQL results
Taking this approach to all the variables produces something like this:
SELECT DISTINCT ?Museum
(SAMPLE(?name) as ?name)
(SAMPLE(?abstract) as ?abstract)
(SAMPLE(?thumbnail) as ?thumbnail)
(MAX(?latitude) as ?latitude)
(MAX(?longitude) as ?longitude)
(SAMPLE(?photoCollection) as ?photoCollection)
(SAMPLE(?website) as ?website)
(SAMPLE(?homepage) as ?homepage)
(SAMPLE(?wikilink) as ?wikilink)
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
geo:lat ?latitude ;
geo:long ?longitude ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20
SPARQL results
It might seem a bit awkward to have to use the aggregate projection on all your variables, but it will work. However, you can also do the aggregation in a subquery first, and that will clean the variable projections up, at the cost of a subquery. (The subquery doesn't necessarily have a negative impact on the query; in fact it could be the opposite. The query itself is a bit harder to read, though.)
SELECT * WHERE {
# Select museums and a single latitude and longitude for them.
{
SELECT ?Museum (MAX(?longitude) as ?longitude) (MAX(?latitude) as ?latitude) WHERE {
?Museum a dbpedia-owl:Museum ;
geo:lat ?latitude ;
geo:long ?longitude .
}
GROUP BY ?Museum
}
# Get the rest of the properties of the museum.
?Museum dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
LIMIT 20
SPARQL results
Finally, since you need to normalize over names as well as geographic coordinates, your final query would be something like the following. In your question, you only said that you wanted to keep the “first result,” but there's no particular order imposed on the results, so there is no unique “first result.” With the data at hand, you can use (MIN(?name) as ?name)
and you'll get the name you wanted for the Institute for Museum Research, but if you have a particular constraint in mind, you'll need to figure out how to make that more specific.
SELECT * WHERE {
# Select museums and a single latitude, longitude, and name for them.
{
SELECT ?Museum
(MIN(?name) as ?name)
(MAX(?longitude) as ?longitude)
(MAX(?latitude) as ?latitude)
WHERE {
?Museum a dbpedia-owl:Museum ;
dbpprop:name ?name ;
geo:lat ?latitude ;
geo:long ?longitude .
FILTER (langMatches(lang(?name),"EN"))
}
GROUP BY ?Museum
}
# Get the rest of the properties of the museum.
?Museum dbpprop:name ?name ;
dbpedia-owl:abstract ?abstract ;
dbpedia-owl:thumbnail ?thumbnail ;
dbpprop:hasPhotoCollection ?photoCollection ;
dbpprop:website ?website ;
foaf:homepage ?homepage ;
foaf:isPrimaryTopicOf ?wikilink .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 20
SPARQL results
来源:https://stackoverflow.com/questions/17174439/dbpedia-sparql-query-returns-multiple-and-duplicate-records