Hello I'm using Curl to get information from Wikipedia,and I want to receive only information about the principal image,I don't want to receive all images of an article.. For example.. If I want to get info about all images of the English Language (http://en.wikipedia.org/wiki/English_language) I should go to this URL: http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&prop=images but I receive flags of countries where people speak English in XML:
<?xml version="1.0"?> <api> <query>
<normalized>
<n from="English_language" to="English language" />
</normalized>
<pages>
<page pageid="8569916" ns="0" title="English language">
<images>
<im ns="6" title="File:Anglospeak(800px)Countries.png" />
<im ns="6" title="File:Anglospeak.svg" />
<im ns="6" title="File:Circle frame.svg" />
<im ns="6" title="File:Commons-logo.svg" />
<im ns="6" title="File:Flag of Argentina.svg" />
<im ns="6" title="File:Flag of Aruba.svg" />
<im ns="6" title="File:Flag of Australia.svg" />
<im ns="6" title="File:Flag of Bolivia.svg" />
<im ns="6" title="File:Flag of Brazil.svg" />
<im ns="6" title="File:Flag of Canada.svg" />
I only want the information about the principal image.
As others have noted, Wikipedia articles don't really have any such thing as a "principal image", so your first problem will be deciding how to choose between the different images used on a given page. Some possible selection criteria might be:
- Biggest image in the article.
- First image exceeding some specific minimum dimensions, e.g. 60 × 60 pixels.
- First image referenced directly in the article's source text, rather than through a template.
For the first two options, you'll want to fetch the rendered HTML code of the page via action=parse
and use an HTML parser to find the img
tags in the code, like this:
http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images
(The reason you can't just get the sizes of the images, as used on the page, directly from the API is that that information isn't actually stored anywhere in the MediaWiki database.)
For the last option, what you want is the source wikitext of the article, available via prop=revisions
with rvprop=content
:
Note that many images in infoboxes and such are specified as parameters to a template, so just parsing for [[Image:...]]
syntax will miss some of them. A better solution is probably to just get the list of all images used on the page via prop=images
(which you can do in the same query, as I showed above) and look for their names (with or without Image:
/ File:
prefix) in the wikitext.
Keep in mind the various ways in which MediaWiki automatically normalizes page (and image) names: most notably, underscores are mapped to spaces, consecutive whitespace is collapsed to a single space and the first letter of the name is capitalized. If you decide to go this way, here's some sample PHP code that will convert a list of file names into a regexp that should match any of them in wikitext:
foreach ($names as &$name) {
$name = trim( preg_replace( '/[_\s]+/u', ' ', $name ) );
$name = preg_quote( $name, '/' );
$name = preg_replace( '/^(\\\\?.)/us', '(?i:$1)', $name );
$name = preg_replace( '/\\\\? /u', '[_\s]+', $name );
}
$regexp = '/' . implode( '|', $names ) . '/u';
For example, when given the list:
Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Flag of Argentina.svg
Flag of Aruba.svg
the generated regexp will be:
/(?i:A)nglospeak\(800px\)Countries\.png|(?i:A)nglospeak\.svg|(?i:C)ircle[_\s]+frame\.svg|(?i:C)ommons\-logo\.svg|(?i:F)lag[_\s]+of[_\s]+Argentina\.svg|(?i:F)lag[_\s]+of[_\s]+Aruba\.svg/u
There's news! (from 2014)
A new extension, PageImages, is available and also got already installed on the Wikimedia wikis.
Instead of prop=images
, use prop=pageimages
, and you'll get a pageimage
attribute and a <thumbnail>
child node for each <page>
element.
Admittedly, it's not guaranteed to give the best results, but in your example (English Language) it works well and only yields the map of the geographic distribution, not all the flags.
Also, the OpenSearch API does return an <image>
in it's xml representation, but this API is not usable with lists and cannot be combine with the Query API.
This is how I got it working...
$.getJSON("http://en.wikipedia.org/w/api.php?action=query&format=json&callback=?", {
titles: "India",
prop: "pageimages",
pithumbsize: 150
},
function(data) {
var source = "";
var imageUrl = GetAttributeValue(data.query.pages);
if (imageUrl == "") {
$("#wiki").append("<div>No image found</div>");
} else {
var img = "<img src=\"" + imageUrl + "\">"
$("#wiki").append(img);
}
}
);
function GetAttributeValue(data) {
var urli = "";
for (var key in data) {
if (data[key].thumbnail != undefined) {
if (data[key].thumbnail.source != undefined) {
urli = data[key].thumbnail.source;
break;
}
}
}
return urli;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<html>
<head></head>
<body>
<div id="wiki"></div>
</body>
</html>
You can limit your query to the first image in the article with the imlimit
parameter:
Important addendum
Bergi's answer, above, seemed super great, but I was bashing my head out because I couldn't get it to work.
I needed to include pilicense=any
in my query, because otherwise any copyrighted imagery was ignored.
Here's the query I ultimately got working:
I know it's been awhile, but this is one of the first pages I landed on when I started my days-long search for how to do this, so I wanted to share this specifically on this page, for others like me who might come here.
来源:https://stackoverflow.com/questions/12147886/how-can-i-get-the-principal-image-from-mediawiki-api