I am trying to query the frequency of certain attributes in Wikidata, using SPARQL.
For example, to find out what the frequency of different values for gender is, I have the following query:
SELECT ?rid (COUNT(?rid) AS ?count)
WHERE { ?qid wdt:P21 ?rid.
BIND(wd:Q5 AS ?human)
?qid wdt:P31 ?human.
} GROUP BY ?rid
I get the following result:
wd:Q6581097 2752163
wd:Q6581072 562339
wd:Q1052281 223
wd:Q1097630 68
wd:Q2449503 67
wd:Q48270 36
wd:Q44148 8
wd:Q43445 4
t152990852 1
t152990762 1
t152990752 1
t152990635 1
t152775383 1
t152775370 1
t152775368 1
...
I have the following questions regarding this:
- What do those
t152...
values refer to? - How can I ignore the tuples containing
t152...
?
I triedFILTER ( !strstarts(str(?rid), "wd:") )
but it timed out. - How can I count the distinct number of answers?
I triedSELECT (COUNT(DISTINCT ?rid) AS ?count)
with the above query, but again it timed out.
Values starting with t
are "skolemized" unknown values (see, e.g., Q2423351 for a person of unknown sex or gender).
In order to improve performance, I suggest you to divide your query into three parts:
All "normal" genders:
SELECT ?rid (COUNT(?qid) AS ?count) WHERE { ?qid wdt:P31 wd:Q5. ?qid wdt:P21 ?rid. ?rid wdt:P31 wd:Q48264 } GROUP BY ?rid ORDER BY DESC(?count)
Please note that, according Wikidata, wd:Q746411 is a subclass of wd:Q48270, etc.
All "non-normal" genders:
SELECT ?rid (COUNT(?qid) AS ?count) WHERE { ?qid wdt:P31 wd:Q5. ?qid wdt:P21 ?rid. FILTER (?rid NOT IN ( wd:Q6581097, wd:Q6581072, wd:Q1052281, wd:Q2449503, wd:Q48270, wd:Q746411, wd:Q189125, wd:Q1399232, wd:Q3277905 ) ). FILTER (isURI(?rid)) } GROUP BY ?rid ORDER BY DESC(?count)
I do not use
FILTER NOT EXISTS {?rid wdt:P31 wd:Q48264 }
due to performance reasons.All (i.e. 1) "unknown" genders:
SELECT (COUNT(?qid) AS ?count) WHERE { ?qid wdt:P31 wd:Q5. ?qid wdt:P21 ?rid. FILTER (!isURI(?rid)) }
In fact, it is not very important in your case — to count distinct wd:Q5 or count them not distinct — but the latter is preferable due to performance reasons.
来源:https://stackoverflow.com/questions/44374813/what-are-values-starting-with-t-and-how-to-ignore-them-for-counting