问题
For example using this Wikipedia dump:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm
Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?
For example:
{height_ft,6},{nationality, American}
回答1:
It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.
回答2:
Just stumbled over a library on PyPi, wikidump, that claims to provide
Tools to manipulate and extract data from wikipedia dumps
I didn't use it yet, so you are on your own to try it...
回答3:
I described how to do this using a combination of pywikibot and mwparserfromhell in this post (don't have enough reputation yet to flag as a duplicate).
In [1]: import mwparserfromhell
In [2]: import pywikibot
In [3]: enwp = pywikibot.Site('en','wikipedia')
In [4]: page = pywikibot.Page(enwp, 'Waking Life')
In [5]: wikitext = page.get()
In [6]: wikicode = mwparserfromhell.parse(wikitext)
In [7]: templates = wikicode.filter_templates()
In [8]: templates?
Type: list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length: 31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [10]: templates[:2]
Out[10]:
[u'{{Use mdy dates|date=September 2012}}',
u"{{Infobox film\n| name = Waking Life\n| image = Waking-Life-Poster.jpg\n| image_size = 220px\n| alt =\n| caption = Theatrical release poster\n| director = [[Richard Linklater]]\n| producer = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer = Richard Linklater\n| starring = [[Wiley Wiggins]]\n| music = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing = Sandra Adair\n| studio = [[Thousand Words]]\n| distributor = [[Fox Searchlight Pictures]]\n| released = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country = United States\n| language = English\n| budget =\n| gross = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
In [11]: infobox_film = templates[1]
In [12]: for param in infobox_film.params:
print param.name, param.value
name Waking Life
image Waking-Life-Poster.jpg
image_size 220px
alt
caption Theatrical release poster
director [[Richard Linklater]]
producer [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
writer Richard Linklater
starring [[Wiley Wiggins]]
music Glover Gill
cinematography Richard Linklater<br />[[Tommy Pallotta]]
editing Sandra Adair
studio [[Thousand Words]]
distributor [[Fox Searchlight Pictures]]
released {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
runtime 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
country United States
language English
budget
gross $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
Don't forget that params are mwparserfromhell objects too!
回答4:
I know the question is old, but I was searching for a library that parses wikipedia xml dump. However, the suggested libraries, wikidump and mwlib, don't offer many code documentation. Then, I found Mediwiki-utilities, which has some code documentation in: http://pythonhosted.org/mediawiki-utilities/.
回答5:
WikiExtractor appears to be a clean, simple, and efficient way to do this in Python today: https://github.com/attardi/wikiextractor
It provides an easy way to parse a Wikipedia dump into a simple file structure like so:
<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>
...where each doc looks like:
<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>
回答6:
There's some information on Python and XML libraries here.
If you're asking is there an existing library that's designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.
Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.
回答7:
You're probably looking for the Pywikipediabot for manipulating the wikipedia API.
回答8:
I would say look at using Beautiful Soup and just get the Wikipedia page in HTML instead of using the API.
I'll try and post an example.
来源:https://stackoverflow.com/questions/3463447/parsing-a-wikipedia-dump