How can I get the principal image from MediaWiki API?

我与影子孤独终老i 提交于 2019-12-04 07:16:59

As others have noted, Wikipedia articles don't really have any such thing as a "principal image", so your first problem will be deciding how to choose between the different images used on a given page. Some possible selection criteria might be:

  • Biggest image in the article.
  • First image exceeding some specific minimum dimensions, e.g. 60 × 60 pixels.
  • First image referenced directly in the article's source text, rather than through a template.

For the first two options, you'll want to fetch the rendered HTML code of the page via action=parse and use an HTML parser to find the img tags in the code, like this:

http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images

(The reason you can't just get the sizes of the images, as used on the page, directly from the API is that that information isn't actually stored anywhere in the MediaWiki database.)


For the last option, what you want is the source wikitext of the article, available via prop=revisions with rvprop=content:

http://en.wikipedia.org/w/api.php?action=query&titles=English_language&prop=revisions|images&rvprop=content

Note that many images in infoboxes and such are specified as parameters to a template, so just parsing for [[Image:...]] syntax will miss some of them. A better solution is probably to just get the list of all images used on the page via prop=images (which you can do in the same query, as I showed above) and look for their names (with or without Image: / File: prefix) in the wikitext.

Keep in mind the various ways in which MediaWiki automatically normalizes page (and image) names: most notably, underscores are mapped to spaces, consecutive whitespace is collapsed to a single space and the first letter of the name is capitalized. If you decide to go this way, here's some sample PHP code that will convert a list of file names into a regexp that should match any of them in wikitext:

foreach ($names as &$name) {
    $name = trim( preg_replace( '/[_\s]+/u', ' ', $name ) );
    $name = preg_quote( $name, '/' );
    $name = preg_replace( '/^(\\\\?.)/us', '(?i:$1)', $name );
    $name = preg_replace( '/\\\\? /u', '[_\s]+', $name );
}
$regexp = '/' . implode( '|', $names ) . '/u';

For example, when given the list:

Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Flag of Argentina.svg
Flag of Aruba.svg

the generated regexp will be:

/(?i:A)nglospeak\(800px\)Countries\.png|(?i:A)nglospeak\.svg|(?i:C)ircle[_\s]+frame\.svg|(?i:C)ommons\-logo\.svg|(?i:F)lag[_\s]+of[_\s]+Argentina\.svg|(?i:F)lag[_\s]+of[_\s]+Aruba\.svg/u

There's news! (from 2014)
A new extension, PageImages, is available and also got already installed on the Wikimedia wikis.

Instead of prop=images, use prop=pageimages, and you'll get a pageimage attribute and a <thumbnail> child node for each <page> element.

Admittedly, it's not guaranteed to give the best results, but in your example (English Language) it works well and only yields the map of the geographic distribution, not all the flags.


Also, the OpenSearch API does return an <image> in it's xml representation, but this API is not usable with lists and cannot be combine with the Query API.

This is how I got it working...

$.getJSON("http://en.wikipedia.org/w/api.php?action=query&format=json&callback=?", {
    titles: "India",
    prop: "pageimages",
    pithumbsize: 150
  },
  function(data) {
    var source = "";
    var imageUrl = GetAttributeValue(data.query.pages);
    if (imageUrl == "") {
      $("#wiki").append("<div>No image found</div>");
    } else {
      var img = "<img src=\"" + imageUrl + "\">"
      $("#wiki").append(img);
    }
  }
);

 function GetAttributeValue(data) {
  var urli = "";
  for (var key in data) {
    if (data[key].thumbnail != undefined) {
      if (data[key].thumbnail.source != undefined) {
        urli = data[key].thumbnail.source;
        break;
      }
    }
  }
  return urli;
}



<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<html>

<head></head>

<body>
  <div id="wiki"></div>
</body>

</html>

Important addendum

Bergi's answer, above, seemed super great, but I was bashing my head out because I couldn't get it to work.

I needed to include pilicense=any in my query, because otherwise any copyrighted imagery was ignored.

Here's the query I ultimately got working:

https://en.wikipedia.org/w/api.php?action=query&pilicense=any&format=jsonfm&prop=pageimages&generator=search&gsrsearch=My+incategory:English-language_films+prefix:My&gsrlimit=3

I know it's been awhile, but this is one of the first pages I landed on when I started my days-long search for how to do this, so I wanted to share this specifically on this page, for others like me who might come here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!