Parse birth and death dates from Wikipedia?

问题

I'm trying to write a python program that can search wikipedia for the birth and death dates for people.

For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

I started with Fetch a Wikipedia article with Python

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()

This works as far as it goes. page2 is the xml representation of the section from Albert Einstein's wikipedia page.

And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.

EDIT

After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:

import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name        = Albert Einstein
| image       = Einstein 1921 portrait2.jpg
| caption     = Albert Einstein in 1921
| birth_date  = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse      = [[Mileva Marić]]&amp;nbsp;(1903–1919)&lt;br&gt;{{nowrap|[[Elsa Löwenthal]]&amp;nbsp;(1919–1936)}}
| residence   = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}

So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re? I can do that, but I feel like I'd be using the wrong tool for this job.

回答1:

You can consider using a library such as BeautifulSoup or lxml to parse the response html/xml.

You may also want to take a look at Requests, which has a much cleaner API for making requests.

Here is the working code using Requests, BeautifulSoup and re, arguably not the best solution here, but it is quite flexible and can be extended for similar problems:

import re
import requests
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml'

res = requests.get(url)
soup = BeautifulSoup(res.text, "xml")

birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText())
birth_data = birth_re.group(0).split('|')
birth_year = birth_data[2]
birth_month = birth_data[3]
birth_day = birth_data[4]

death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText())
death_data = death_re.group(0).split('|')
death_year = death_data[2]
death_month = death_data[3]
death_day = death_data[4]

Per @JBernardo's suggestion using JSON data and mwparserfromhell, a better answer for this particular use case:

import requests
import mwparserfromhell

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json'

res = requests.get(url)
text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"]
wiki = mwparserfromhell.parse(text)

birth_data = wiki.filter_templates(matches="Birth date")[0]
birth_year = birth_data.get(1).value
birth_month = birth_data.get(2).value
birth_day = birth_data.get(3).value

death_data = wiki.filter_templates(matches="Death date")[0]
death_year = death_data.get(1).value
death_month = death_data.get(2).value
death_day = death_data.get(3).value

回答2:

First: The wikipedia API allows the use of JSON instead of XML and that will make things much easier.

Second: There's no need to use HTML/XML parsers at all (the content is not HTML nor the container need to be). What you need to parse is this Wiki format inside "revisions" tag of the JSON.

Check some Wiki parsers here

What seems to be confusing here is that the API allows you to request a certain format (XML or JSON) but that's is just a container for some text in the real format you want to parse:

This one: {{Birth date|df=yes|1879|3|14}}

With one of the parsers provided in the link above, you will be able to do that.

回答3:

First, use pywikipedia. It allows you to query article text, template parameters etc. through a high-level abstract interface. Second, I would go with the Persondata template (look towards the end of the article). Also, in the long term, you might be interested in Wikidata, which will take several months to introduce, but it will make most metadata in Wikipedia articles easily queryable.

回答4:

The persondata template is deprecated now, and you should instead access Wikidata. See Wikidata:Data access. My earlier (now deprecated) answer from 2012 was as follows:

What you should do is to parse the {{persondata}} template found in most biographical articles. There are existing tools for easily extracting such data programmatically, with your existing knowledge and the other helpful answers I am sure you can make that work.

回答5:

One alternative in 2019 is to use the Wikidata API, which, among other things, exposes biographical data like birth and death dates in a structured format that is very easy to consume without any custom parsers. Many Wikipedia articles depend on Wikidata for their info, so in many cases this will be the same as if you were consuming Wikipedia data.

For example, look at the Wikidata page for Albert Einstein and search for "date of birth" and "date of death", you will find they are the same as in Wikipedia. Every entity in Wikidata has a list of "claims" which are pairs of "properties" and "values". To know when Einstein was born and died, we only need to search the list of statements for the appropriate properties, in this case, P569 and P570. To do this programatically, it's best to access the entity as json, which you can do with the following url structure:

https://www.wikidata.org/wiki/Special:EntityData/Q937.json

And as an example, here is what the claim P569 states about Einstein:

        "P569": [
          {
            "mainsnak": {
              "property": "P569",
              "datavalue": {
                "value": {
                  "time": "+1879-03-14T00:00:00Z",
                  "timezone": 0,
                  "before": 0,
                  "after": 0,
                  "precision": 11,
                  "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                },
                "type": "time"
              },
              "datatype": "time"
            },
            "type": "statement",

You can learn more about accessing Wikidata in this article, and more specifically about how dates are structured in Help:Dates.

来源：https://stackoverflow.com/questions/12250580/parse-birth-and-death-dates-from-wikipedia

标签

python

mediawiki

wikipedia

wikipedia-api

mediawiki-api